skimindex Documentation
⚠ Project under active development
Results produced by the pipeline should not be considered correct or reliable at this stage.
Results produced by the pipeline should not be considered correct or reliable at this stage.
Contents
-
Directory Structure — layout of runtime data directories (raw data, processed outputs, indexes, stamps) and development directories (source code, Docker build, scripts)
-
Configuration File Format — specification of
config/skimindex.toml: configuration sections, source sections, role sections, processing sections, and data sections -
Data Model — runtime concepts: sources (ncbi, genbank, internal), datasets (
Dataset,to_data()), and roles — how they bind together and drive the pipeline -
Processing Model — how processing sections work: atomic vs composite, Data abstraction, input chaining, output resolution
Quick Start
- Copy
config/skimindex.tomland edit it for your datasets. - Adjust host paths in
[local_directories]to match your storage layout. - Add one data section per dataset, with
sourceandrolekeys. - Run
skimindex.shto launch the pipeline inside the container.
Pipeline Flow
Raw data Processed data Indexes
──────────────── ────────────────── ──────────
[source]/ [processed_data]/ [indexes]/
{data}/ {role}/ {role}/
{species}/ {data}/ …
{accession}/ {species}/
*.gbff.gz {accession}/
{processing}/
*.fasta.gz
Sources: ncbi, genbank, internal
Roles: decontamination, genomes, genome_skims
Processing steps: split, kmercount