Skip to content

skimindex.sources

skimindex.sources

skimindex.sources — source registry and directory helpers.

A source ([source.X] in TOML) is an external data provider with a configured directory under SKIMINDEX_ROOT.

Usage

from skimindex.sources import (
    source_dir, dataset_download_dir,
    output_dir, dataset_output_dir,
)

genbank_root  = source_dir("genbank")
human_dl_dir  = dataset_download_dir("human")
decontam_root = output_dir("role", "decontamination")
human_out_dir = dataset_output_dir("human")

source_dir

source_dir(source: str) -> Path

Root directory for a named source.

Reads [source.].directory and resolves against SKIMINDEX_ROOT.

Example

source_dir("genbank") → Path("/data/genbank") source_dir("ncbi") → Path("/data/genbank") # shares dir with genbank

dataset_download_dir

dataset_download_dir(dataset_name: str) -> Path

Download output directory for a named dataset.

Resolves: source_dir(dataset.source) / dataset.directory where dataset.directory defaults to dataset_name if not set.

Example

dataset_download_dir("human") → Path("/data/genbank/human") dataset_download_dir("betula_nana") → Path("/data/raw_data/Betula_nana")

output_dir

output_dir(section_kind: str, section_name: str) -> Path

Processing output directory for a named config section.

Reads the section's 'directory' key and resolves it under the appropriate root for the section kind: - "role" → processed_data_dir() / section.directory - "index" → indexes_dir() / section.directory

Parameters:

Name Type Description Default
section_kind str

"role" or "index"

required
section_name str

Name of the sub-section, e.g. "decontamination"

required

Examples:

output_dir("role", "decontamination") → /processed_data/decontamination output_dir("role", "genomes") → /processed_data/genomes_15x

dataset_output_dir

dataset_output_dir(dataset_name: str) -> Path

Processing output directory for a dataset, resolved under its role.

Resolves: output_dir("role", dataset.role) / dataset.directory

Example

dataset_output_dir("human") → /processed_data/decontamination/human dataset_output_dir("plants") → /processed_data/decontamination/Plants

resolve_artifact

resolve_artifact(
    value: str | dict, dataset_subdir: Path | None = None
) -> Path

Resolve an artifact reference to an absolute path.

Accepted forms

"parts@decontamination" → processed_data/{role_dir}/{dataset_subdir}/parts "parts@idx:decontamination" → indexes/{role_dir}/{dataset_subdir}/parts "@idx:decontamination" → indexes/{role_dir}/ (meta-index, no subdir) {"role": "decontamination", "dir": "parts"} → same as string form {"role": "idx:decontamination", "dir": ""} → meta-index

Parameters:

Name Type Description Default
value str | dict

Artifact reference — string ("dir@[idx:]role") or dict ({"role": "…", "dir": "…"}).

required
dataset_subdir Path | None

Relative subdir within the role tree (e.g. Path("Human/Homo_sapiens--GCF_…")). Pass None for meta-index references where no per-dataset level is needed.

None

Returns:

Type Description
Path

Absolute Path to the artifact directory.

Raises:

Type Description
ValueError

If value is a string without an @ separator, or if role_name cannot be resolved (falls back to role name as directory).