skimindex.naming
skimindex.naming
skimindex.naming — canonical naming rules for genome files and directories.
The -- sequence is the reserved separator between a taxon name and an accession
in level-0 filenames. All business logic for converting between the
(species, accession, ext, compressed) tuple and filesystem paths lives here.
Species-organised layouts (by_species=true)
Level 0 — flat file, single accession per species: {Species_name}--{accession}.{ext}[.gz] processed_data subdir: {Species_name}/{accession}
No-accession layout (accession unknown, any source): processed_data subdir: {Species_name}/default
Level 1 — species subdirectory, one file per accession: {Species_name}/{accession}.{ext}[.gz] processed_data subdir: {Species_name}/{accession}
Level 2 — species+accession subdirectories, multiple files per accession: {Species_name}/{accession}/* .{ext}[.gz] processed_data subdir: {Species_name}/{accession}
Non-species-organised layout (by_species=false)
Release_{N}/fasta/{division}/gb{div}{N}.{ext}[.gz]
processed_data subdir: {division}
Public API
parse_genome_path(path) → (species, accession, ext, compressed) Accepts level-0, level-1, and level-2 paths (relative to data.directory). For level-0, accession is the stem part after '--'; output subdir uses "default".
genome_subdir(species, accession) → Path Build the canonical processed_data relative path: Species_name/accession. For level-0 sources pass accession="default".
output_subdir_for(path) → Path Convenience: parse_genome_path + genome_subdir in one call. Returns the correct processed_data relative subdir for any level.
parse_division_path(path) → (division, filename, ext, compressed) Parse a non-species-organised GenBank path to extract the division name.
genome_filename(species, accession, ext, compressed) → str Build the canonical level-0 filename.
canonical_species(name) → str Normalise a raw species/taxon name to the canonical underscore form.
scan_species_dir(directory) → Iterator[tuple[Path, Path]] Scan a species-organised directory and yield (absolute_file, subdir) pairs. subdir is the species/accession relative path, derived from directory structure for level-1/2, and from the filename for level-0.
canonical_species
canonical_species(name: str) -> str
Normalise a raw taxon/species name to the canonical underscore form.
Rules (applied in order): 1. Spaces → underscores. 2. Characters that are not alphanumeric, '_', '-', or '.' are removed. (This includes parentheses, quotes, slashes, brackets, etc.) 3. Leading/trailing underscores or hyphens are stripped.
Examples:
canonical_species("Homo sapiens") → "Homo_sapiens" canonical_species("Brassica rapa subsp. chinensis") → "Brassica_rapa_subsp._chinensis" canonical_species("Mentha × piperita") → "Mentha_piperita"
genome_filename
genome_filename(
species: str,
accession: str,
ext: str,
compressed: bool = True,
) -> str
Build the canonical level-0 filename.
Example
genome_filename("Homo_sapiens", "GCF_000001405.40", "gbff") → "Homo_sapiens--GCF_000001405.40.gbff.gz"
genome_subdir
genome_subdir(species: str, accession: str) -> Path
Build the canonical processed_data relative path: Species_name/accession.
When no accession is available, pass accession="default" as a conventional placeholder for a single individual of unknown or untracked accession.
Examples:
genome_subdir("Homo_sapiens", "GCF_000001405.40") → Path("Homo_sapiens/GCF_000001405.40") genome_subdir("Betula_nana", "default") → Path("Betula_nana/default")
parse_genome_path
parse_genome_path(
path: Path | str,
) -> tuple[str, str, str, bool]
Parse a species-organised genome path into (species, accession, ext, compressed).
The path must be relative to {data.directory}. Three levels are recognised:
Level 0 — flat filename with '--' separator: Homo_sapiens--GCF_000001405.40.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)
Level 1 — species subdirectory, one file per accession: Homo_sapiens/GCF_000001405.40.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)
Level 2 — species+accession subdirectories, multiple files: Homo_sapiens/GCF_000001405.40/sequence.gbff.gz → ("Homo_sapiens", "GCF_000001405.40", "gbff", True)
Raises:
| Type | Description |
|---|---|
ValueError
|
if the path cannot be parsed or has an unrecognised extension. |
parse_division_path
parse_division_path(
path: Path | str,
) -> tuple[str, str, str, bool]
Parse a non-species-organised GenBank path into (division, filename, ext, compressed).
Expected layout (path relative to source-directory): Release_{N}/fasta/{division}/gb{div}{N}.{ext}[.gz]
Returns:
| Type | Description |
|---|---|
str
|
division — e.g. "bct", "pln" |
str
|
filename — e.g. "gbpln1.fasta.gz" |
str
|
ext — base extension without dot, e.g. "fasta" |
bool
|
compressed — True if ".gz" |
Example
parse_division_path(Path("Release_270/fasta/bct/gbbct1.fasta.gz")) → ("bct", "gbbct1.fasta.gz", "fasta", True)
Raises:
| Type | Description |
|---|---|
ValueError
|
if the path structure is not recognised. |
output_subdir_for
output_subdir_for(path: Path | str) -> Path
Return the processed_data relative subdir for a species-organised source path.
Combines parse_genome_path and genome_subdir. The accession is always extracted from the path, including for level-0 flat files.
Examples:
output_subdir_for(Path("Homo_sapiens--GCF_000001405.40.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40") output_subdir_for(Path("Homo_sapiens/GCF_000001405.40.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40") output_subdir_for(Path("Homo_sapiens/GCF_000001405.40/seq.gbff.gz")) → Path("Homo_sapiens/GCF_000001405.40")
scan_species_dir
scan_species_dir(
directory: Path | str,
) -> Iterator[tuple[Path, Path]]
Scan a species-organised directory and yield (absolute_file, subdir) pairs.
subdir is the species/accession path (e.g. Homo_sapiens/GCF_000001405.40), derived from directory structure for level-1 and level-2, and from the filename (using the '--' separator) for level-0.
Three layouts are recognised (relative to directory):
Level 0 — flat file: {Species}--{accession}.ext → subdir = Species/accession
Level 1 — species subdirectory: {Species}/{accession}.ext → subdir = Species/accession
Level 2 — species+accession subdirectories: {Species}/{accession}/*.ext → subdir = Species/accession
Files that cannot be parsed (unknown extension, missing separator at level-0) are silently skipped.
Yields:
| Type | Description |
|---|---|
tuple[Path, Path]
|
(absolute_path, subdir) where subdir is a relative Path. |