Skip to content

skimindex.sequences

skimindex.sequences

Sequence file discovery utilities.

Provides list_sequence_files() to enumerate sequence files (.fasta, .gbff, …) in a directory. Both compressed (.gz) and uncompressed variants are discovered. Three path modes are available:

relative — path relative to the given directory e.g. Homo_sapiens-GCF_000001405.40.gbff.gz absolute — full absolute path e.g. /genbank/Plants/Homo_sapiens-GCF_000001405.40.gbff.gz prefixed — directory name prepended to the relative path e.g. Plants/Homo_sapiens-GCF_000001405.40.gbff.gz

species_list() discovers species under the species/ subdirectory of a genome data directory (raw or processed). The layout is expected to be::

{directory}/species/{Species_name}/{individual}/...
Example

from skimindex.sequences import list_sequence_files, species_list

files = list_sequence_files("/genbank/Plants", mode="prefixed") for f in files: print(f) # Plants/Homo_sapiens-GCF_000001405.40.gbff.gz

species = species_list("/raw_data/genomes_15x", mode="absolute") for name, path in species.items(): print(name, path) # Betula_nana /raw_data/genomes_15x/species/Betula_nana

list_sequence_files

list_sequence_files(
    directory: PathLike,
    mode: PathMode = "relative",
    extensions: tuple = SEQUENCE_EXTENSIONS,
    recursive: bool = False,
    compressed: bool = True,
    uncompressed: bool = True,
) -> list[Path]

Return a sorted list of sequence files found in directory.

For each extension (e.g. ".fasta"), both the compressed (".fasta.gz") and uncompressed (".fasta") variants are searched, controlled by the compressed and uncompressed flags.

Parameters:

Name Type Description Default
directory PathLike

Directory to search.

required
mode PathMode

How to express the returned paths: "relative" — relative to directory "absolute" — full absolute path "prefixed" — directory name prepended (e.g. Plants/file.gbff.gz)

'relative'
extensions tuple

Tuple of base suffixes without .gz (default: SEQUENCE_EXTENSIONS).

SEQUENCE_EXTENSIONS
recursive bool

If True, search subdirectories recursively.

False
compressed bool

Include compressed variants (e.g. .fasta.gz).

True
uncompressed bool

Include uncompressed variants (e.g. .fasta).

True

Returns:

Type Description
list[Path]

Sorted list of Path objects.

Raises:

Type Description
FileNotFoundError

if directory does not exist.

ValueError

if mode is not one of the three allowed values.

species_list

species_list(
    directory: PathLike, mode: PathMode = "relative"
) -> dict[str, Path]

Return a mapping of species name → species directory for a genome dataset.

Expects the layout::

{directory}/species/{Species_name}/{individual}/...

Parameters:

Name Type Description Default
directory PathLike

Root of the genome dataset (e.g. raw_data/genomes_15x).

required
mode PathMode

How to express the returned paths: "relative" — relative to directory "absolute" — full absolute path "prefixed" — directory name prepended (e.g. genomes_15x/species/Betula_nana)

'relative'

Returns:

Type Description
dict[str, Path]

Sorted dict {species_name: path} where species_name is the

dict[str, Path]

directory name (e.g. "Betula_nana") and path follows mode.

Raises:

Type Description
FileNotFoundError

if directory does not exist.

ValueError

if mode is not one of the three allowed values.

genome_species_list

genome_species_list(
    mode: PathMode = "relative", data_type: DataType = "raw"
) -> dict[str, Path]

Return the species list for the genome dataset, using paths from config.

Reads [genomes] directory and the appropriate base directory (raw_data or processed_data from [local_directories]) to build the path, then delegates to :func:species_list.

Parameters:

Name Type Description Default
mode PathMode

Path mode — "relative", "absolute" or "prefixed".

'relative'
data_type DataType

Which data tree to inspect: "raw" — raw sequencing reads (/raw_data/…) "processed" — pipeline outputs (/processed_data/…)

'raw'

Returns:

Type Description
dict[str, Path]

Sorted dict {species_name: path} (species names use spaces, not underscores).

Raises:

Type Description
FileNotFoundError

if the resolved genome directory does not exist.

ValueError

if mode is invalid.