Skip to content

Pipeline Commands

The following subcommands are built into the container image and available via skimindex <command>. For global options, runtime detection, and bind-mount configuration see Entry Point.

All pipeline commands share a common set of flags inherited from the SkimCommand base class:

Flag Description
--list Print available sections (datasets or divisions) as multi-line CSV with header and exit.
--dry-run Show what would be done without executing anything.
--help Show command help and exit.

download — Download raw data

Download GenBank flat-file releases and NCBI reference genome assemblies.

skimindex download                        # download everything
skimindex download genbank [options]      # GenBank flat-files only
skimindex download ncbi    [options]      # NCBI genome assemblies only

download genbank

Downloads GenBank flat-file divisions declared in [source.genbank].

Option Description
--division DIV Process a single GenBank division (e.g. pln, bct).
--status Show download status without downloading.
--list Print available divisions as multi-line CSV and exit.
--dry-run Show what would be downloaded without executing.

--list output example:

division
bct
pln

download ncbi

Downloads NCBI reference genome assemblies declared as source = "ncbi" data sections.

Option Description
--dataset NAME Process a single NCBI dataset (e.g. human, plants).
--taxon TAXON Query assemblies for a taxon and display results (no download).
--one-per species\|genus Keep only one assembly per species or genus.
--assembly-level LEVEL Filter by assembly level (e.g. complete, chromosome).
--assembly-source SOURCE Filter by assembly source (refseq, genbank).
--assembly-version VERSION Filter by assembly version (e.g. latest).
--reference Filter to reference assemblies only.
--status Show download status without downloading.
--list Print available datasets as multi-line CSV and exit.
--dry-run Show what would be downloaded without executing.

--list output example:

dataset
human
plants

decontam — Build decontamination k-mer index

Builds a kmindex presence/absence index from a set of reference sequences representing biological contaminants (e.g. human, fungi, bacteria). The resulting index is used at query time to identify and remove reads whose k-mer content overlaps with the reference.

The pipeline runs per dataset declared with role = "decontamination" in skimindex.toml and proceeds in three sequential steps:

skimindex decontam                        # run full pipeline (prepare + count + index)
skimindex decontam prepare [options]      # step 1 — fragment reference sequences
skimindex decontam count   [options]      # step 2 — count k-mers (ntcard)
skimindex decontam index   [options]      # step 3 — build kmindex sub-indexes
skimindex decontam clean   [options]      # step 4 — decontaminate genome/skim datasets

Each step is stamped independently. Re-running decontam skips any dataset whose output is already stamped.

decontam prepare

Transforms each reference sequence file into a set of fixed-length overlapping fragments, filters out degenerate sequences (N-only), and distributes the result into a fixed number of FASTA batch files.

Processing chain (configured in [processing.prepare_decontam]):

  1. split — slides a window of size bp with overlap bp between consecutive fragments over each input sequence (default: size = 200, overlap = 28). The overlap value should be set to \(k - 1\) where \(k\) is the k-mer size used for indexing, so that every k-mer spanning a fragment boundary is represented.
  2. filter_n_only — discards fragments composed entirely of N bases.
  3. distribute — partitions fragments into batches FASTA files (default: 20) to allow parallel processing downstream.

Output: processed_data/decontamination/{dataset}/parts/

Option Description
--dataset NAME Process a single decontamination dataset (e.g. human, fungi).
--list Print available datasets as multi-line CSV and exit.
--dry-run Show what would be processed without executing.

--list output example:

dataset
human
fungi
bacteria
plants

decontam count

Estimates the k-mer frequency spectrum of the prepared fragments using ntCard.

ntCard produces a histogram file {prefix}_k{K}.hist per dataset with the frequency distribution of all k-mers of length \(k\) (kmer_size, default 29). The histogram includes two aggregate statistics used by the next step:

  • F0 — estimated total k-mer count (including duplicates across fragments).
  • F1 — estimated number of distinct k-mers.

Output: processed_data/decontamination/{dataset}/kmercount/

Option Description
--dataset NAME Process a single decontamination dataset.
--list Print available datasets as multi-line CSV and exit.
--dry-run Show what would be processed without executing.

decontam index

Builds a kmindex Bloom filter sub-index for each decontamination dataset and registers it in the global meta-index at indexes/decontamination/. Called once per dataset (not once per assembly).

FOF generation

Before calling kmindex build, a kmtricks file-of-files (FOF) is generated automatically:

  • The parts/ directory of the dataset is scanned recursively for assembly subdirectories.
  • One sample is created per assembly subdirectory of parts/.
  • Sample names are derived from the relative path of the subdirectory: re.sub(r"[^A-Za-z0-9_-]", "_", "--".join(rel.parts)).
  • register_as = first component of the dataset's subdir path (e.g. "Human", "Plants").
  • The FOF file is named {register_as}.fof and stored in processed_data/decontamination/{dataset}/kmindex/.

FOF example for Plants (62 assemblies, one sample per assembly):

Spermatophyta--GCF_000001735_4 : /path/parts/Spermatophyta/GCF_000001735.4/file1.fa.gz ; /path/parts/Spermatophyta/GCF_000001735.4/file2.fa.gz
Spermatophyta--GCF_000002775_5 : /path/parts/Spermatophyta/GCF_000002775.5/file1.fa.gz
...

Bloom filter sizing

kmindex uses a single-hash presence/absence Bloom filter (--bloom-size flag). For a query requiring \(z\) consecutive k-mer hits to call a positive match, the false positive probability is:

\[P_{fpr} = \left(\frac{n}{n + m}\right)^z\]

where \(n\) is the number of k-mers inserted into the filter (taken from the maximum F1 across all samples in the ntcard histograms — not the sum) and \(m\) is the number of cells in the filter (bloom_size). Inverting for \(m\):

\[m = \left\lceil n \cdot \left(p^{-1/z} - 1\right) \right\rceil\]

The bloom_size value is computed automatically from the ntcard histograms produced by decontam count, using the fpr and zvalue parameters declared in [processing.build_index_decontam] (defaults: \(p = 10^{-3}\), \(z = 3\)). hard_min = 1 is used because reference sequences are not sequencing data — every k-mer occurrence counts.

Index structure

Each dataset produces a sub-index registered by name in the global meta-index:

indexes/decontamination/
├── Human/       ← sub-index run dir (created by kmindex)
├── Fungi/
├── Bacteria/
└── Plants/

The root indexes/decontamination/ is the meta-index consumed by kmindex query or kmindex query2 at decontamination time. The stamp for each dataset is written to processed_data/decontamination/{dataset}/kmindex/.

Option Description
--dataset NAME Process a single decontamination dataset.
--list Print available datasets as multi-line CSV and exit.
--dry-run Show what would be processed without executing.

decontam clean

Decontaminates genome and genome-skim datasets against the contamination k-mer index built by the three steps above. Operates on all datasets with role = "genomes" or role = "genome_skims".

For each individual, the pipeline:

  1. Annotates each read file with kmquery.lua via obiscript, outputting compressed FASTA with kmindex_score and contamination annotations.
  2. Filters with obigrep using the predicate annotations.contamination && annotations.kmindex_score >= min_score. Paired-end reads are filtered with --paired-mode or (a pair is discarded when either read is contaminated).

Output per individual (always gzip-compressed FASTA):

processed_data/{role_dir}/{dataset}/{species}/{individual}/cleaned/
    {sample}_good.fasta.gz       — sequences that pass (not contaminated)
    {sample}_discarded.fasta.gz  — sequences that fail (contaminated)
skimindex decontam clean                             # clean all datasets
skimindex decontam clean --dataset species_15x       # one dataset
skimindex decontam clean --dataset species_15x --species Betula_nana
skimindex decontam clean --dataset species_15x --species Betula_nana --individual IGA-24-34
Option Description
--dataset NAME Process a single genomes/genome_skims dataset.
--species NAME Restrict to a single species (as listed by --list).
--individual NAME Restrict to a single individual (requires --species).
--list Print available dataset/species/individual combinations as multi-line CSV and exit.
--dry-run Show what would be processed without executing.

The --list output is a multi-column CSV:

dataset,species,individual
species_15x,Betula_exilis,IGA-24-33
species_15x,Betula_nana,IGA-24-34
vaccinium_skims,Vaccinium_myrtillus,SAMEA9098823
…

validate — Validate configuration

Loads config/skimindex.toml, runs all validation rules, and reports errors.

skimindex validate [--config PATH]
Option Description
--config PATH Path to the config file (default: /config/skimindex.toml).

Exits with code 0 if valid, 1 if errors are found.


User subcommands

Scripts placed in the project's usercmd/ directory are automatically available as subcommands without rebuilding the image.

skimindex <name> [options]

Each script runs inside the container with:

  • usercmd/ bind-mounted to /usercmd/
  • SKIMINDEX_SCRIPTS_DIR=/app/scripts set, so scripts can source the skimindex libraries:
source "${SKIMINDEX_SCRIPTS_DIR}/__skimindex.sh"   # log + config + stamping

The first non-empty comment line of the script (after the shebang and any separator) is used as its description in skimindex --help.