snapatac2.pp.import_fragments#

snapatac2.pp.import_fragments(fragment_file, chrom_sizes, *, is_paired=True, file=None, min_num_fragments=200, sorted_by_barcode=True, whitelist=None, chrM=['chrM', 'M'], chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#

Import fragment files and compute basic QC metrics.

Use this function to load single-cell ATAC fragments into an AnnData object before constructing count matrices. The function stores fragments in .obsm, records reference sequence sizes in .uns, and computes cell-level QC metrics such as fragment counts, duplicate fraction, and mitochondrial fraction.

A fragment refers to the sequence data originating from a distinct location in the genome. In single-ended sequencing, one read equates to a fragment. However, in paired-ended sequencing, a fragment is defined by a pair of reads. This function is designed to handle, store, and process input files with fragment data, further yielding a range of basic Quality Control (QC) metrics. These metrics include the total number of unique fragments, duplication rates, and the percentage of mitochondrial DNA detected.

How fragments are stored is dependent on the sequencing approach utilized. For single-ended sequencing, fragments are found in .obsm['fragment_single']. In contrast, for paired-ended sequencing, they are located in .obsm['fragment_paired'].

Diving deeper technically, the fragments are internally structured within a Compressed Sparse Row (CSR) matrix. In this configuration, each row signifies a specific cell, while each column represents a unique genomic position. Fragment starting positions dictate the column indices. Matrix values capture the lengths of the fragments for paired-end reads or the lengths of the reads for single-ended scenarios. It’s important to note that for single-ended reads, the values are signed, with the sign providing information on the fragment’s strand orientation. Additionally, it is worth noting that cells may harbor duplicate fragments, leading to the presence of duplicate column indices within the matrix. As a result, the matrix deviates from the standard CSR format, and it is not advisable to use the matrix for linear algebra operations.

Anti-Patterns#

Do NOT run linear algebra directly on .obsm['fragment_paired'] or .obsm['fragment_single']; these matrices encode fragment positions and lengths, not a standard feature-by-cell count matrix.
Do NOT set sorted_by_barcode=True for unsorted fragment files; set it to False so this function sorts them first.
Do NOT omit the if __name__ == "__main__" guard in Python scripts when importing a list of files with multiprocessing.

Notes

This function accepts both single-end and paired-end reads. The argument is_paired specifies the type of reads in the fragment file.
When file is not None, this function uses constant memory regardless of the size of the input file.
When sorted_by_barcode is False, this function will sort the fragment file first, during which temporary files will be created in tempdir. The size of temporary files is proportional to the number of records in the fragment file. For large fragment files, it is recommended to set tempdir to a location with sufficient space in order to avoid running out of disk space.
The QC metrics are computed only for reads that are included by the whitelist or chrom_sizes.

Warning

When the input to the function is a list of files, it employs multiprocessing to process these files concurrently. In this case, however, it is crucial to safeguard the entry point of the program by encapsulating the function call within if __name__ == '__main__':. This condition ensures that the module is being run as the main program and not being loaded as a module from another script. Without this protection, each subprocess might attempt to spawn its own subprocesses, leading to a cascade of process spawns—a situation that can cause the program to hang or crash due to infinite recursion. You don’t need to do this in Jupyter notebook as it automatically does that.

type fragment_file:: Path | list[Path]
param fragment_file:: File name of the fragment file, optionally compressed with gzip or zstd. This can be a single file or a list of files. If it is a list of files, a separate AnnData object will be created for each file. A fragment file must contain at least 5 columns: chromosome, start, end, barcode, count. Optionally it can contain one more column indicating the strand of the fragment.
type chrom_sizes:: Genome | dict[str, int]
param chrom_sizes:: A Genome object or a dictionary containing chromosome sizes, for example, {"chr1": 2393, "chr2": 2344, ...}.
type is_paired:: bool
param is_paired:: Indicate whether the fragment file contains paired-end reads.
type file:: Path | list[Path] | None
param file:: File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. If fragment_file is a list of files, file must also be a list of files if provided.
type min_num_fragments:: int
param min_num_fragments:: Number of unique fragments threshold used to filter cells
type sorted_by_barcode:: bool
param sorted_by_barcode:: Whether the fragment file has been sorted by cell barcodes. This function will be faster if sorted_by_barcode==True. Note the make_fragment_file will always sort the fragment file by barcode.
type whitelist:: Path | list[str] | None
param whitelist:: File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.
type chrM:: list[str]
param chrM:: A list of chromosome names that are considered mitochondrial DNA. This is used to compute the fraction of mitochondrial DNA.
type chunk_size:: int
param chunk_size:: Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.
type tempdir:: Path | None
param tempdir:: Location to store temporary files. If None, system temporary directory will be used.
type backend:: Literal['hdf5']
param backend:: The backend.
type n_jobs:: int
param n_jobs:: Number of jobs to run in parallel when fragment_file is a list. If n_jobs=-1, all CPUs will be used.
returns:: An annotated data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to regions. If file=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.
rtype:: AnnData