snapatac2.pp.import_data#
- snapatac2.pp.import_data(fragment_file, *, file=None, genome=None, gene_anno=None, chrom_size=None, min_num_fragments=200, min_tsse=1, sorted_by_barcode=True, low_memory=True, whitelist=None, shift_left=0, shift_right=0, chunk_size=2000, tempdir=None, backend='hdf5', n_jobs=8)[source]#
Import dataset and compute QC metrics.
This function will store fragments as base-resolution TN5 insertions in the resulting h5ad file (in
.obsm['insertion']), along with the chromosome sizes (in.uns['reference_sequences']). Various QC metrics, including TSSe, number of unique fragments, duplication rate, fraction of mitochondrial DNA reads, will be computed. The.obsm['insertion']matrix created in this step is essential for downstream analysis, such as tile matrix generation and peak calling.- Parameters:
fragment_file (
Path|list[Path]) – File name of the fragment file. This can be a single file or a list of files.file (
Union[Path,list[Path],None]) – File name of the output h5ad file used to store the result. If provided, result will be saved to a backed AnnData, otherwise an in-memory AnnData is used. Iffragment_fileis a list of files,filemust also be a list of files if provided.genome (
Optional[Genome]) – A Genome object, providing gene annotation and chromosome sizes. If not set,gff_fileandchrom_sizemust be provided.genomehas lower priority thangff_fileandchrom_size.gene_anno (
Optional[Path]) – File name of the gene annotation file in GFF or GTF format. This is required ifgenomeis not set. Settinggene_annowill override the annotations from thegenomeparameter.chrom_size (
Optional[dict[str,int]]) – A dictionary containing chromosome sizes, for example,{"chr1": 2393, "chr2": 2344, ...}. This is required ifgenomeis not set. Settingchrom_sizewill override the chrom_size from thegenomeparameter.min_num_fragments (
int) – Number of unique fragments threshold used to filter cellsmin_tsse (
float) – TSS enrichment threshold used to filter cellssorted_by_barcode (
bool) – Whether the fragment file has been sorted by cell barcodes. Ifsorted_by_barcode == True, this function makes use of small fixed amout of memory. Ifsorted_by_barcode == Falseandlow_memory == False, all data will be kept in memory. Seelow_memoryfor more details.low_memory (
bool) – Whether to use the low memory mode whensorted_by_barcode == False. It does this by first sort the records by barcodes and then process them in batch. The parameter has no effect whensorted_by_barcode == True.whitelist (
Union[Path,list[str],None]) – File name or a list of barcodes. If it is a file name, each line must contain a valid barcode. When provided, only barcodes in the whitelist will be retained.shift_left (
int) – Insertion site correction for the left end. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.shift_right (
int) – Insertion site correction for the right end. Note this has no effect on single-end reads. For single-end reads,shift_rightwill be set using the value ofshift_left. This is set to 0 by default, as shift correction is usually done in the fragment file generation step.chunk_size (
int) – Increasing the chunk_size may speed up I/O but will use more memory. The speed gain is usually not significant.tempdir (
Optional[Path]) – Location to store temporary files. IfNone, system temporary directory will be used.backend (
Literal['hdf5']) – The backend.n_jobs (
int) – Number of jobs to run in parallel whenfragment_fileis a list. Ifn_jobs=-1, all CPUs will be used.
- Returns:
An annotated data matrix of shape
n_obsxn_vars. Rows correspond to cells and columns to regions. Iffile=None, an in-memory AnnData will be returned, otherwise a backed AnnData is returned.- Return type:
AnnData | ad.AnnData
See also
Examples
>>> import snapatac2 as snap >>> data = snap.pp.import_data(snap.datasets.pbmc500(), genome=snap.genome.hg38, sorted_by_barcode=False) >>> print(data) AnnData object with n_obs × n_vars = 816 × 0 obs: 'tsse', 'n_fragment', 'frac_dup', 'frac_mito' uns: 'reference_sequences' obsm: 'insertion'