skimindex.sequences

Sequence file discovery utilities.

Provides list_sequence_files() to enumerate sequence files (.fasta, .gbff, …) in a directory. Both compressed (.gz) and uncompressed variants are discovered. Three path modes are available:

relative — path relative to the given directory e.g. Homo_sapiens-GCF_000001405.40.gbff.gz absolute — full absolute path e.g. /genbank/Plants/Homo_sapiens-GCF_000001405.40.gbff.gz prefixed — directory name prepended to the relative path e.g. Plants/Homo_sapiens-GCF_000001405.40.gbff.gz

species_list() discovers species under the species/ subdirectory of a genome data directory (raw or processed). The layout is expected to be::

{directory}/species/{Species_name}/{individual}/...

Example

from skimindex.sequences import list_sequence_files, species_list

files = list_sequence_files("/genbank/Plants", mode="prefixed") for f in files: print(f) # Plants/Homo_sapiens-GCF_000001405.40.gbff.gz

species = species_list("/raw_data/genomes_15x", mode="absolute") for name, path in species.items(): print(name, path) # Betula_nana /raw_data/genomes_15x/species/Betula_nana

list_sequence_files

list_sequence_files(
    directory: PathLike,
    mode: PathMode = "relative",
    extensions: tuple = SEQUENCE_EXTENSIONS,
    recursive: bool = False,
    compressed: bool = True,
    uncompressed: bool = True,
) -> list[Path]

Return a sorted list of sequence files found in directory.

For each extension (e.g. ".fasta"), both the compressed (".fasta.gz") and uncompressed (".fasta") variants are searched, controlled by the compressed and uncompressed flags.

Parameters:

Name	Type	Description	Default
`directory`	`PathLike`	Directory to search.	required
`mode`	`PathMode`	How to express the returned paths: "relative" — relative to directory "absolute" — full absolute path "prefixed" — directory name prepended (e.g. Plants/file.gbff.gz)	`'relative'`
`extensions`	`tuple`	Tuple of base suffixes without .gz (default: SEQUENCE_EXTENSIONS).	`SEQUENCE_EXTENSIONS`
`recursive`	`bool`	If True, search subdirectories recursively.	`False`
`compressed`	`bool`	Include compressed variants (e.g. .fasta.gz).	`True`
`uncompressed`	`bool`	Include uncompressed variants (e.g. .fasta).	`True`

Returns:

Type	Description
`list[Path]`	Sorted list of Path objects.

Raises:

Type	Description
`FileNotFoundError`	if directory does not exist.
`ValueError`	if mode is not one of the three allowed values.

species_list

species_list(
    directory: PathLike, mode: PathMode = "relative"
) -> dict[str, Path]

Return a mapping of species name → species directory for a genome dataset.

Expects the layout::

{directory}/species/{Species_name}/{individual}/...

Parameters:

Name	Type	Description	Default
`directory`	`PathLike`	Root of the genome dataset (e.g. `raw_data/genomes_15x`).	required
`mode`	`PathMode`	How to express the returned paths: `"relative"` — relative to directory `"absolute"` — full absolute path `"prefixed"` — directory name prepended (e.g. `genomes_15x/species/Betula_nana`)	`'relative'`

Returns:

Type	Description
`dict[str, Path]`	Sorted dict `{species_name: path}` where species_name is the
`dict[str, Path]`	directory name (e.g. `"Betula_nana"`) and path follows mode.

Raises:

Type	Description
`FileNotFoundError`	if directory does not exist.
`ValueError`	if mode is not one of the three allowed values.

genome_species_list

genome_species_list(
    mode: PathMode = "relative", data_type: DataType = "raw"
) -> dict[str, Path]

Return the species list for the genome dataset, using paths from config.

Reads [genomes] directory and the appropriate base directory (raw_data or processed_data from [local_directories]) to build the path, then delegates to :func:species_list.

Parameters:

Name	Type	Description	Default
`mode`	`PathMode`	Path mode — `"relative"`, `"absolute"` or `"prefixed"`.	`'relative'`
`data_type`	`DataType`	Which data tree to inspect: `"raw"` — raw sequencing reads (`/raw_data/…`) `"processed"` — pipeline outputs (`/processed_data/…`)	`'raw'`

Returns:

Type	Description
`dict[str, Path]`	Sorted dict `{species_name: path}` (species names use spaces, not underscores).

Raises:

Type	Description
`FileNotFoundError`	if the resolved genome directory does not exist.
`ValueError`	if mode is invalid.