snapatac2.pp.import_fragments#
- snapatac2.pp.import_fragments(fragment_file, chrom_sizes, *, is_paired=True, file=None, min_num_fragments=200, sorted_by_barcode=True, whitelist=None, chrM=['chrM', 'M'], chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#
Import fragment files and compute basic QC metrics.
Use this function to load single-cell ATAC fragments into an AnnData object before constructing count matrices. The function stores fragments in
.obsm, records reference sequence sizes in.uns, and computes cell-level QC metrics such as fragment counts, duplicate fraction, and mitochondrial fraction.A fragment refers to the sequence data originating from a distinct location in the genome. In single-ended sequencing, one read equates to a fragment. However, in paired-ended sequencing, a fragment is defined by a pair of reads. This function is designed to handle, store, and process input files with fragment data, further yielding a range of basic Quality Control (QC) metrics. These metrics include the total number of unique fragments, duplication rates, and the percentage of mitochondrial DNA detected.
How fragments are stored is dependent on the sequencing approach utilized. For single-ended sequencing, fragments are found in
.obsm['fragment_single']. In contrast, for paired-ended sequencing, they are located in.obsm['fragment_paired'].Diving deeper technically, the fragments are internally structured within a Compressed Sparse Row (CSR) matrix. In this configuration, each row signifies a specific cell, while each column represents a unique genomic position. Fragment starting positions dictate the column indices. Matrix values capture the lengths of the fragments for paired-end reads or the lengths of the reads for single-ended scenarios. It’s important to note that for single-ended reads, the values are signed, with the sign providing information on the fragment’s strand orientation. Additionally, it is worth noting that cells may harbor duplicate fragments, leading to the presence of duplicate column indices within the matrix. As a result, the matrix deviates from the standard CSR format, and it is not advisable to use the matrix for linear algebra operations.
Anti-Patterns#
Do NOT run linear algebra directly on
.obsm['fragment_paired']or.obsm['fragment_single']; these matrices encode fragment positions and lengths, not a standard feature-by-cell count matrix.Do NOT set
sorted_by_barcode=Truefor unsorted fragment files; set it toFalseso this function sorts them first.Do NOT omit the
if __name__ == "__main__"guard in Python scripts when importing a list of files with multiprocessing.
Notes
This function accepts both single-end and paired-end reads. The argument
is_pairedspecifies the type of reads in the fragment file.When
fileis notNone, this function uses constant memory regardless of the size of the input file.When
sorted_by_barcodeisFalse, this function will sort the fragment file first, during which temporary files will be created intempdir. The size of temporary files is proportional to the number of records in the fragment file. For large fragment files, it is recommended to settempdirto a location with sufficient space in order to avoid running out of disk space.The QC metrics are computed only for reads that are included by the
whitelistorchrom_sizes.
Warning
When the input to the function is a list of files, it employs multiprocessing to process these files concurrently. In this case, however, it is crucial to safeguard the entry point of the program by encapsulating the function call within
if __name__ == '__main__':. This condition ensures that the module is being run as the main program and not being loaded as a module from another script. Without this protection, each subprocess might attempt to spawn its own subprocesses, leading to a cascade of process spawns—a situation that can cause the program to hang or crash due to infinite recursion. You don’t need to do this in Jupyter notebook as it automatically does that.- type fragment_file:
- param fragment_file:
File name of the fragment file, optionally compressed with gzip or zstd. This can be a single file or a list of files. If it is a list of files, a separate AnnData object will be created for each file. A fragment file must contain at least 5 columns: chromosome, start, end, barcode, count. Optionally it can contain one more column indicating the strand of the fragment.
- type chrom_sizes:
- param chrom_sizes:
A Genome object or a dictionary containing chromosome sizes, for example,
{"chr1": 2393, "chr2": 2344, ...}.- type is_paired:
- param is_paired:
Indicate whether the fragment file contains paired-end reads.
- type file:
- param file:
File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. If
fragment_fileis a list of files,filemust also be a list of files if provided.- type min_num_fragments:
- param min_num_fragments:
Number of unique fragments threshold used to filter cells
- type sorted_by_barcode:
- param sorted_by_barcode:
Whether the fragment file has been sorted by cell barcodes. This function will be faster if
sorted_by_barcode==True. Note themake_fragment_filewill always sort the fragment file by barcode.- type whitelist:
- param whitelist:
File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.
- type chrM:
- param chrM:
A list of chromosome names that are considered mitochondrial DNA. This is used to compute the fraction of mitochondrial DNA.
- type chunk_size:
- param chunk_size:
Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.
- type tempdir:
- param tempdir:
Location to store temporary files. If
None, system temporary directory will be used.- type backend:
Literal['hdf5']- param backend:
The backend.
- type n_jobs:
- param n_jobs:
Number of jobs to run in parallel when
fragment_fileis a list. Ifn_jobs=-1, all CPUs will be used.- returns:
An annotated data matrix of shape
n_obsxn_vars. Rows correspond to cells and columns to regions. Iffile=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.- rtype:
AnnData
See also
Examples
>>> import snapatac2 as snap >>> fragments = snap.datasets.pbmc500(downsample=True) >>> data = snap.pp.import_fragments( ... fragments, ... chrom_sizes=snap.genome.hg38, ... sorted_by_barcode=False, ... ) >>> {"n_fragment", "frac_dup", "frac_mito"}.issubset(data.obs.columns) True