skimindex Documentation

⚠ Project under active development
Results produced by the pipeline should not be considered correct or reliable at this stage.

Directory Structure — layout of runtime data directories (raw data, processed outputs, indexes, stamps) and development directories (source code, Docker build, scripts)
Configuration File Format — specification of config/skimindex.toml: configuration sections, source sections, role sections, processing sections, and data sections
Data Model — runtime concepts: sources (ncbi, genbank, internal), datasets (Dataset, to_data()), and roles — how they bind together and drive the pipeline
Processing Model
Entry Point
Tools — third-party tools bundled in the container image: OBITools4, kmindex, ntCard, SRA Toolkit, NCBI Datasets CLI, IBM Aspera, with versions, binary locations, and bibliographic references
Pipeline Commands — reference for download, decontam, and validate subcommands — skimindex.sh usage, global options, built-in subcommands, container runtime detection, and bind-mount mechanism — how processing sections work: atomic vs composite, Data abstraction, input chaining, output resolution

Quick Start

Copy config/skimindex.toml and edit it for your datasets.
Adjust host paths in [local_directories] to match your storage layout.
Add one data section per dataset, with source and role keys.
Run skimindex.sh to launch the pipeline inside the container.

Pipeline Flow

Raw data                 Processed data              Indexes
────────────────         ──────────────────          ──────────
[source]/                [processed_data]/           [indexes]/
  {data}/                  {role}/                     {role}/
    {species}/               {data}/                     …
      {accession}/             {species}/
        *.gbff.gz                {accession}/
                                   {processing}/
                                     *.fasta.gz

Sources: ncbi, genbank, internal Roles: decontamination, genomes, genome_skims Processing steps: split, kmercount, buildindex

skimindex Documentation

Contents

Quick Start

Pipeline Flow