skimindex.sequences
skimindex.sequences
Sequence file discovery utilities.
Provides list_sequence_files() to enumerate sequence files (.fasta, .gbff, …) in a directory. Both compressed (.gz) and uncompressed variants are discovered. Three path modes are available:
relative — path relative to the given directory e.g. Homo_sapiens-GCF_000001405.40.gbff.gz absolute — full absolute path e.g. /genbank/Plants/Homo_sapiens-GCF_000001405.40.gbff.gz prefixed — directory name prepended to the relative path e.g. Plants/Homo_sapiens-GCF_000001405.40.gbff.gz
species_list() discovers species under the species/ subdirectory of a
genome data directory (raw or processed). The layout is expected to be::
{directory}/species/{Species_name}/{individual}/...
Example
from skimindex.sequences import list_sequence_files, species_list
files = list_sequence_files("/genbank/Plants", mode="prefixed") for f in files: print(f) # Plants/Homo_sapiens-GCF_000001405.40.gbff.gz
species = species_list("/raw_data/genomes_15x", mode="absolute") for name, path in species.items(): print(name, path) # Betula_nana /raw_data/genomes_15x/species/Betula_nana
list_sequence_files
list_sequence_files(
directory: PathLike,
mode: PathMode = "relative",
extensions: tuple = SEQUENCE_EXTENSIONS,
recursive: bool = False,
compressed: bool = True,
uncompressed: bool = True,
) -> list[Path]
Return a sorted list of sequence files found in directory.
For each extension (e.g. ".fasta"), both the compressed (".fasta.gz") and uncompressed (".fasta") variants are searched, controlled by the compressed and uncompressed flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
PathLike
|
Directory to search. |
required |
mode
|
PathMode
|
How to express the returned paths: "relative" — relative to directory "absolute" — full absolute path "prefixed" — directory name prepended (e.g. Plants/file.gbff.gz) |
'relative'
|
extensions
|
tuple
|
Tuple of base suffixes without .gz (default: SEQUENCE_EXTENSIONS). |
SEQUENCE_EXTENSIONS
|
recursive
|
bool
|
If True, search subdirectories recursively. |
False
|
compressed
|
bool
|
Include compressed variants (e.g. .fasta.gz). |
True
|
uncompressed
|
bool
|
Include uncompressed variants (e.g. .fasta). |
True
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
Sorted list of Path objects. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
if directory does not exist. |
ValueError
|
if mode is not one of the three allowed values. |
species_list
species_list(
directory: PathLike, mode: PathMode = "relative"
) -> dict[str, Path]
Return a mapping of species name → species directory for a genome dataset.
Expects the layout::
{directory}/species/{Species_name}/{individual}/...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
PathLike
|
Root of the genome dataset (e.g. |
required |
mode
|
PathMode
|
How to express the returned paths:
|
'relative'
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Sorted dict |
dict[str, Path]
|
directory name (e.g. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
if directory does not exist. |
ValueError
|
if mode is not one of the three allowed values. |
genome_species_list
genome_species_list(
mode: PathMode = "relative", data_type: DataType = "raw"
) -> dict[str, Path]
Return the species list for the genome dataset, using paths from config.
Reads [genomes] directory and the appropriate base directory
(raw_data or processed_data from [local_directories]) to build
the path, then delegates to :func:species_list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
PathMode
|
Path mode — |
'relative'
|
data_type
|
DataType
|
Which data tree to inspect:
|
'raw'
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Sorted dict |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
if the resolved genome directory does not exist. |
ValueError
|
if mode is invalid. |