skimindex.datasets
skimindex.datasets
skimindex.datasets — enumeration and access of [data.X] config blocks.
Each dataset binds a source (ncbi, genbank, internal) to a role (decontamination, genomes, genome_skims) with download and processing parameters.
Usage
from skimindex.datasets import Dataset, datasets_for_role
for ds in datasets_for_role("decontamination"):
for data in ds.to_data():
pipeline(data, ds.output_dir, dry_run=False)
Dataset
Dataset(name: str, cfg: dict[str, Any])
A configured [data.X] block with typed access and Data conversion.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Dataset name (the |
|
source |
str
|
Data origin — |
role |
str
|
Pipeline role — |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Dataset name as declared in the TOML file. |
required |
cfg
|
dict[str, Any]
|
Raw |
required |
download_dir
property
download_dir: Path
Directory where source files were downloaded.
output_dir
property
output_dir: Path
Processing output directory for this dataset.
get
get(key: str, default: Any = None) -> Any
Return a raw config value for this dataset, without type conversion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Config key to look up (e.g. |
required |
default
|
Any
|
Value to return if the key is absent (default: |
None
|
to_index_data
to_index_data() -> Data
Return a single Data object representing the full dataset output directory.
Used by indexers that process all assemblies of a dataset at once.
Returns:
| Type | Description |
|---|---|
Data
|
A DIRECTORY |
Data
|
and |
to_data
to_data() -> Iterator[Data]
Yield Data objects representing this dataset's input files.
Yields:
| Type | Description |
|---|---|
Data
|
One |
Data
|
one |
Data
|
kind), or one |
Data
|
sources ( |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
all_datasets
all_datasets() -> dict[str, dict[str, Any]]
Return all [data.X] sections keyed by dataset name.
datasets_for_source
datasets_for_source(source: str) -> list[str]
Return the names of all datasets whose source matches source.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Source type to filter on ( |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of dataset names. |
Example
datasets_for_source("ncbi") # ["human", "fungi"]
datasets_for_source("genbank") # ["bacteria"]
datasets_for_role
datasets_for_role(role: str) -> list[Dataset]
Return Dataset objects for all datasets assigned to role.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
role
|
str
|
Role name to filter on ( |
required |
Returns:
| Type | Description |
|---|---|
list[Dataset]
|
List of |
Example
datasets_for_role("decontamination")
# [Dataset('human', ...), Dataset('bacteria', ...)]
dataset_config
dataset_config(name: str) -> dict[str, Any]
Return the config dict for a single dataset (empty dict if not found).
Example
dataset_config("human") → {"source": "ncbi", "role": "decontamination", "taxon": "human", "example": True}
get_dataset
get_dataset(name: str) -> Dataset
Return a Dataset object for a named dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Dataset name as declared in |
required |
Returns:
| Type | Description |
|---|---|
Dataset
|
The corresponding |
Raises:
| Type | Description |
|---|---|
KeyError
|
If name is not found in the config. |