Skip to content

skimindex.datasets

skimindex.datasets

skimindex.datasets — enumeration and access of [data.X] config blocks.

Each dataset binds a source (ncbi, genbank, internal) to a role (decontamination, genomes, genome_skims) with download and processing parameters.

Usage

from skimindex.datasets import Dataset, datasets_for_role

for ds in datasets_for_role("decontamination"):
    for data in ds.to_data():
        pipeline(data, ds.output_dir, dry_run=False)

Dataset

Dataset(name: str, cfg: dict[str, Any])

A configured [data.X] block with typed access and Data conversion.

Attributes:

Name Type Description
name

Dataset name (the X in [data.X]).

source str

Data origin — "ncbi", "genbank", "sra", or "internal".

role str

Pipeline role — "decontamination", "genomes", etc.

Parameters:

Name Type Description Default
name str

Dataset name as declared in the TOML file.

required
cfg dict[str, Any]

Raw [data.X] dict from the parsed config.

required

download_dir property

download_dir: Path

Directory where source files were downloaded.

output_dir property

output_dir: Path

Processing output directory for this dataset.

get

get(key: str, default: Any = None) -> Any

Return a raw config value for this dataset, without type conversion.

Parameters:

Name Type Description Default
key str

Config key to look up (e.g. "taxon", "divisions").

required
default Any

Value to return if the key is absent (default: None).

None

to_index_data

to_index_data() -> Data

Return a single Data object representing the full dataset output directory.

Used by indexers that process all assemblies of a dataset at once.

Returns:

Type Description
Data

A DIRECTORY Data with subdir set to Path(self.directory)

Data

and path pointing to the dataset output directory.

to_data

to_data() -> Iterator[Data]

Yield Data objects representing this dataset's input files.

Yields:

Type Description
Data

One Data per genome file for ncbi sources (FILES kind),

Data

one Data per GenBank division for genbank sources (STREAM

Data

kind), or one Data per R1/R2 pair for sra / internal

Data

sources (FILES kind with 1 or 2 paths).

Raises:

Type Description
ValueError

If source is not a supported value.

all_datasets

all_datasets() -> dict[str, dict[str, Any]]

Return all [data.X] sections keyed by dataset name.

datasets_for_source

datasets_for_source(source: str) -> list[str]

Return the names of all datasets whose source matches source.

Parameters:

Name Type Description Default
source str

Source type to filter on ("ncbi", "genbank", "sra", …).

required

Returns:

Type Description
list[str]

List of dataset names.

Example
datasets_for_source("ncbi")    # ["human", "fungi"]
datasets_for_source("genbank") # ["bacteria"]

datasets_for_role

datasets_for_role(role: str) -> list[Dataset]

Return Dataset objects for all datasets assigned to role.

Parameters:

Name Type Description Default
role str

Role name to filter on ("decontamination", "genomes", …).

required

Returns:

Type Description
list[Dataset]

List of Dataset instances.

Example
datasets_for_role("decontamination")
# [Dataset('human', ...), Dataset('bacteria', ...)]

dataset_config

dataset_config(name: str) -> dict[str, Any]

Return the config dict for a single dataset (empty dict if not found).

Example

dataset_config("human") → {"source": "ncbi", "role": "decontamination", "taxon": "human", "example": True}

get_dataset

get_dataset(name: str) -> Dataset

Return a Dataset object for a named dataset.

Parameters:

Name Type Description Default
name str

Dataset name as declared in [data.X].

required

Returns:

Type Description
Dataset

The corresponding Dataset instance.

Raises:

Type Description
KeyError

If name is not found in the config.