skimindex Documentation
Results produced by the pipeline should not be considered correct or reliable at this stage.
Contents
-
Directory Structure — layout of runtime data directories (raw data, processed outputs, indexes, stamps) and development directories (source code, Docker build, scripts)
-
Configuration File Format — specification of
config/skimindex.toml: configuration sections, source sections, role sections, processing sections, and data sections -
Data Model — runtime concepts: sources (ncbi, genbank, internal), datasets (
Dataset,to_data()), and roles — how they bind together and drive the pipeline -
Tools — third-party tools bundled in the container image: OBITools4, kmindex, ntCard, SRA Toolkit, NCBI Datasets CLI, IBM Aspera, with versions, binary locations, and bibliographic references
-
Pipeline Commands — reference for
download,decontam, andvalidatesubcommands —skimindex.shusage, global options, built-in subcommands, container runtime detection, and bind-mount mechanism — how processing sections work: atomic vs composite, Data abstraction, input chaining, output resolution
Quick Start
- Copy
config/skimindex.tomland edit it for your datasets. - Adjust host paths in
[local_directories]to match your storage layout. - Add one data section per dataset, with
sourceandrolekeys. - Run
skimindex.shto launch the pipeline inside the container.
Pipeline Flow
Raw data Processed data Indexes
──────────────── ────────────────── ──────────
[source]/ [processed_data]/ [indexes]/
{data}/ {role}/ {role}/
{species}/ {data}/ …
{accession}/ {species}/
*.gbff.gz {accession}/
{processing}/
*.fasta.gz
Sources: ncbi, genbank, internal
Roles: decontamination, genomes, genome_skims
Processing steps: split, kmercount, buildindex