skimindex.naming

skimindex.naming — canonical naming rules for genome files and directories.

The -- sequence is the reserved separator between a taxon name and an accession in level-0 filenames. All business logic for converting between the (species, accession, ext, compressed) tuple and filesystem paths lives here.

Species-organised layouts (by_species=true)

Level 0 — flat file, single accession per species: {Species_name}--{accession}.{ext}[.gz] processed_data subdir: {Species_name}/{accession}

No-accession layout (accession unknown, any source): processed_data subdir: {Species_name}/default

Level 1 — species subdirectory, one file per accession: {Species_name}/{accession}.{ext}[.gz] processed_data subdir: {Species_name}/{accession}

Level 2 — species+accession subdirectories, multiple files per accession: {Species_name}/{accession}/* .{ext}[.gz] processed_data subdir: {Species_name}/{accession}

Non-species-organised layout (by_species=false)

Release_{N}/fasta/{division}/gb{div}{N}.{ext}[.gz]
processed_data subdir: {division}

Public API

parse_genome_path(path) → (species, accession, ext, compressed) Accepts level-0, level-1, and level-2 paths (relative to data.directory). For level-0, accession is the stem part after '--'; output subdir uses "default".

genome_subdir(species, accession) → Path Build the canonical processed_data relative path: Species_name/accession. For level-0 sources pass accession="default".

output_subdir_for(path) → Path Convenience: parse_genome_path + genome_subdir in one call. Returns the correct processed_data relative subdir for any level.

parse_division_path(path) → (division, filename, ext, compressed) Parse a non-species-organised GenBank path to extract the division name.

genome_filename(species, accession, ext, compressed) → str Build the canonical level-0 filename.

canonical_species(name) → str Normalise a raw species/taxon name to the canonical underscore form.

scan_species_dir(directory) → Iterator[tuple[Path, Path]] Scan a species-organised directory and yield (absolute_file, subdir) pairs. subdir is the species/accession relative path, derived from directory structure for level-1/2, and from the filename for level-0.

canonical_species

canonical_species(name: str) -> str

Normalise a raw taxon/species name to the canonical underscore form.

Rules (applied in order): 1. Spaces → underscores. 2. Characters that are not alphanumeric, '_', '-', or '.' are removed. (This includes parentheses, quotes, slashes, brackets, etc.) 3. Leading/trailing underscores or hyphens are stripped.

Examples:

canonical_species("Homo sapiens") → "Homo_sapiens" canonical_species("Brassica rapa subsp. chinensis") → "Brassica_rapa_subsp._chinensis" canonical_species("Mentha × piperita") → "Mentha_piperita"

genome_filename

genome_filename(
    species: str,
    accession: str,
    ext: str,
    compressed: bool = True,
) -> str

Build the canonical level-0 filename.

Example

genome_filename("Homo_sapiens", "GCF_000001405.40", "gbff") → "Homo_sapiens--GCF_000001405.40.gbff.gz"

genome_subdir

genome_subdir(species: str, accession: str) -> Path

Build the canonical processed_data relative path: Species_name/accession.

When no accession is available, pass accession="default" as a conventional placeholder for a single individual of unknown or untracked accession.

Examples:

genome_subdir("Homo_sapiens", "GCF_000001405.40") → Path("Homo_sapiens/GCF_000001405.40") genome_subdir("Betula_nana", "default") → Path("Betula_nana/default")

parse_genome_path

parse_genome_path(
    path: Path | str,
) -> tuple[str, str, str, bool]

Parse a species-organised genome path into (species, accession, ext, compressed).

The path must be relative to {data.directory}. Three levels are recognised:

Level 0 — flat filename with '--' separator: Homo_sapiens--GCF_000001405.40.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)

Level 1 — species subdirectory, one file per accession: Homo_sapiens/GCF_000001405.40.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)

Level 2 — species+accession subdirectories, multiple files: Homo_sapiens/GCF_000001405.40/sequence.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)

Raises:

Type	Description
`ValueError`	if the path cannot be parsed or has an unrecognised extension.

parse_division_path

parse_division_path(
    path: Path | str,
) -> tuple[str, str, str, bool]

Parse a non-species-organised GenBank path into (division, filename, ext, compressed).

Expected layout (path relative to source-directory): Release_{N}/fasta/{division}/gb{div}{N}.{ext}[.gz]

Returns:

Type	Description
`str`	division — e.g. "bct", "pln"
`str`	filename — e.g. "gbpln1.fasta.gz"
`str`	ext — base extension without dot, e.g. "fasta"
`bool`	compressed — True if ".gz"

Example

parse_division_path(Path("Release_270/fasta/bct/gbbct1.fasta.gz")) → ("bct", "gbbct1.fasta.gz", "fasta", True)

Raises:

Type	Description
`ValueError`	if the path structure is not recognised.

output_subdir_for

output_subdir_for(path: Path | str) -> Path

Return the processed_data relative subdir for a species-organised source path.

Combines parse_genome_path and genome_subdir. The accession is always extracted from the path, including for level-0 flat files.

Examples:

output_subdir_for(Path("Homo_sapiens--GCF_000001405.40.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40") output_subdir_for(Path("Homo_sapiens/GCF_000001405.40.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40") output_subdir_for(Path("Homo_sapiens/GCF_000001405.40/seq.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40")

scan_species_dir

scan_species_dir(
    directory: Path | str,
) -> Iterator[tuple[Path, Path]]

Scan a species-organised directory and yield (absolute_file, subdir) pairs.

subdir is the species/accession path (e.g. Homo_sapiens/GCF_000001405.40), derived from directory structure for level-1 and level-2, and from the filename (using the '--' separator) for level-0.

Three layouts are recognised (relative to directory):

Level 0 — flat file: {Species}--{accession}.ext → subdir = Species/accession

Level 1 — species subdirectory: {Species}/{accession}.ext → subdir = Species/accession

Level 2 — species+accession subdirectories: {Species}/{accession}/*.ext → subdir = Species/accession

Files that cannot be parsed (unknown extension, missing separator at level-0) are silently skipped.

Yields:

Type	Description
`tuple[Path, Path]`	(absolute_path, subdir) where subdir is a relative Path.