skimindex Data Model — Sources, Datasets, and Roles

Overview

Three concepts organise all data in skimindex:

Concept	Config prefix	What it represents
Source	`[source.X]`	Where raw sequences come from (download origin)
Dataset	`[data.X]`	A named collection of sequences with a specific purpose
Role	`[role.X]`	A category of use — groups datasets and drives processing

Every dataset binds exactly one source and one role:

[source.ncbi] ──────────────────┐
                                 ▼
                          [data.human]  ──▶  [role.decontamination]
                                 ▲
[source.genbank] ───────────────┘  (e.g. [data.fungi])

The TOML syntax for all three is specified in Configuration File Format.

Sources

A source represents an origin for raw sequence data. There are three:

`ncbi`

Data downloaded via the NCBI Datasets CLI (datasets download genome). Files are organised per genome assembly:

{source.ncbi.directory}/
  {data.directory}/
    {Species}--{accession}.gbff.gz     ← level-0 layout
    {Species}/
      {accession}.gbff.gz              ← level-1 layout
      {accession}/
        *.gbff.gz                      ← level-2 layout (multi-file assemblies)

The canonical filename convention uses -- as the separator between species and accession (see Directory Structure).

`genbank`

Data downloaded as GenBank flat-file releases from the NCBI FTP site. Files are not split per dataset — the entire release is shared:

{source.genbank.directory}/
  Release_{N}.0/
    fasta/
      bct/    ← bacterial sequences
      pln/    ← plant/fungi sequences
      …
    taxonomy/
      ncbi_taxonomy.tgz

A dataset with source = "genbank" filters sequences out of these flat-files at pipeline time using obigrep with: - a division filter (divisions key) — selects flat-files by GenBank division - an optional taxid filter (taxid key) — keeps only sequences in a given taxon

`internal`

In-house sequencing data. No download step — files must be placed manually in the source directory. The expected layout is a level-2 species-organised tree:

{source.internal.directory}/{data.directory}/
  {Species}/
    {individual}/
      {sample}_R1_001.fastq.gz   ← Illumina paired-end R1
      {sample}_R2_001.fastq.gz   ← Illumina paired-end R2
      {sample}.fasta.gz           ← or single-end / Nanopore (any OBITools format)

Pairing is detected automatically: two files in the same individual directory are considered a pair if and only if they are identical in length and differ in exactly one character ('1' vs '2'). Nanopore or single-file samples (no R1/R2 pattern) are yielded as single-end Data items.

`sra`

Raw reads downloaded from NCBI SRA via fasterq-dump. EBI (ERR/ERS/SAMEA) and DDBJ accessions are mirrored at NCBI and supported transparently.

The download layout follows the same level-2 species-organised structure:

{source.sra.directory}/{data.directory}/
  {organism}/
    {biosample}/
      {run}_1.fastq.gz
      {run}_2.fastq.gz

Pairing uses the same '1'/'2' detection as internal data.

Datasets

A dataset ([data.X]) is a named collection of sequences that: - comes from one source (where to find the raw files) - serves one role (what the pipeline will do with them)

Runtime representation — `Dataset`

At runtime, a dataset is represented by a Dataset object (skimindex.datasets.Dataset). Its primary method is:

ds.to_data() -> Iterator[Data]

to_data() yields one or more Data objects ready to enter a processing pipeline. The exact number and kind depends on the source:

Source	Yields	One `Data` per
`ncbi`	`FILES`	genome assembly file (recursive scan of download dir)
`genbank`	`FILES` or `STREAM`	division (after optional taxid filtering)
`internal`	`FILES`	individual — paired-end (R1+R2) or single-end, any OBITools format
`sra`	`FILES`	run — paired-end (R1+R2) downloaded by fasterq-dump

Each Data object carries a subdir — the relative path from the processed data root up to (but not including) the processing output directory. This encodes the full dataset/species/accession context so that the pipeline can compute output paths without additional parameters.

Each Data object also carries a per_species: bool flag (default True). When False, processing steps switch to per-part mode: k-mer counting and index building operate on individual fragment files rather than the whole parts/ directory at once. This flag is set automatically by the dataset source:

Source	`per_species`	Indexing mode
`ncbi`	`True`	one sample per assembly (`parts/` dir)
`genbank` (`by_species = false`)	`False`	one sample per part file (`frg_N`)

Listing and filtering datasets

from skimindex.datasets import datasets_for_role, get_dataset, all_datasets

# All datasets with a given role
for ds in datasets_for_role("decontamination"):
    print(ds.name, ds.source)

# A single named dataset
ds = get_dataset("human")

Roles

A role ([role.X]) defines how a group of datasets is processed. It: - provides an output directory name (the role subdirectory in processed data) - declares the default processing pipeline to run (run key)

Linking datasets to roles

Every dataset declares its role via the role key:

[data.human]
source = "ncbi"
role   = "decontamination"   # ← belongs to this role

[data.fungi]
source = "genbank"
role   = "decontamination"   # ← also belongs to this role

The role groups all datasets that share the same processing purpose.

Linking roles to processing

A role declares which processing pipeline to run by default:

[role.decontamination]
directory = "decontamination"
run       = "prepare_decontam"   # default pipeline for all datasets in this role

Individual datasets can override this with their own run key (see Configuration File Format).

Linking processing to roles

A processing section can also declare which role it operates on via its own role key. This makes the section self-describing — the runner can discover which datasets to process without any hardcoding:

[processing.prepare_decontam]
role      = "decontamination"   # operates on all datasets with this role
directory = "parts"
steps     = [...]

See Processing Model for details on how named input parameters and artifact references work in processing sections.

Data flow summary

[source.X]          [data.Y]           [role.Z]         [processing.W]
   │                   │                   │                   │
   │ raw files         │ source + role     │ directory         │ role + input
   │                   │ → Dataset         │ run = "W"         │ directory
   ▼                   ▼                   ▼                   ▼
download/          Dataset.to_data()  processed_data/    pipeline execution
{data}/              → Data(FILES)     {role}/             → output in
{species}/           → Data(STREAM)     {data}/              {role}/{data}/
{accession}/                             {species}/            {species}/
                                          {accession}/          {accession}/
                                           {processing}/

The subdir carried by each Data object encodes the path segments between the processed data root and the processing output directory, so that output paths are always computed from the data itself, never passed as external parameters.