snapatac2.pp.make_gene_matrix#
- snapatac2.pp.make_gene_matrix(adata, gene_anno, *, inplace=False, file=None, backend='hdf5', chunk_size=500, use_x=False, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, transcript_name_key='transcript_name', transcript_id_key='transcript_id', gene_name_key='gene_name', gene_id_key='gene_id', min_frag_size=None, max_frag_size=None, counting_strategy='paired-insertion')[source]#
Generate a cell-by-gene activity matrix.
Use this function after
import_fragmentsto summarize chromatin accessibility over each gene’s regulatory domain. The regulatory domain is defined from the TSS, or from the full gene body wheninclude_gene_body=True, then extended byupstreamanddownstreambase pairs.Anti-Patterns#
Do NOT call this function on an AnnData object that was not created by
import_fragments; the input must contain fragment or insertion storage.Do NOT set
use_x=Trueunless.Xalready contains the binned or peak counts that should be reused for gene aggregation.Do NOT expect
fileorbackendto affect output storage wheninplace=True; these arguments are only used wheninplace=False.Do NOT change annotation key names unless the GFF/GTF file uses different attribute names than the defaults.
- type adata:
AnnData|AnnDataSet- param adata:
The imported AnnData object, or AnnDataSet, containing per-cell fragment or insertion storage.
- type gene_anno:
- param gene_anno:
Genome object or path to a gene annotation file in GFF or GTF format.
- type inplace:
- param inplace:
If
True, store the gene activity matrix inadata.Xand returnNone. IfFalse, return a new AnnData object containing the gene activity matrix.- type file:
- param file:
Output file for the returned AnnData object when
inplace=False. If provided, the result is stored as backed AnnData. IfNone, the result is returned in memory. This argument has no effect wheninplace=True.- type backend:
- param backend:
Backend used for backed output when
fileis provided.- type chunk_size:
- param chunk_size:
Number of genes processed per chunk. Increase this value to improve I/O throughput when memory is sufficient; decrease it to reduce peak memory use.
- type use_x:
- param use_x:
If
True, use the matrix stored in.Xto compute gene activity. IfFalse, use the imported fragment or insertion storage.- type id_type:
Literal['gene','transcript']- param id_type:
Feature identifier to aggregate by. Use “gene” to aggregate transcripts into genes, or “transcript” to keep transcript-level entries.
- type upstream:
- param upstream:
Number of base pairs to extend upstream of the regulatory domain.
- type downstream:
- param downstream:
Number of base pairs to extend downstream of the regulatory domain.
- type include_gene_body:
- param include_gene_body:
If
True, include the full gene body before extension. IfFalse, use the TSS as the regulatory domain before extension.- type transcript_name_key:
- param transcript_name_key:
Attribute key for transcript names in the gene annotation file.
- type transcript_id_key:
- param transcript_id_key:
Attribute key for transcript IDs in the gene annotation file.
- type gene_name_key:
- param gene_name_key:
Attribute key for gene names in the gene annotation file.
- type gene_id_key:
- param gene_id_key:
Attribute key for gene IDs in the gene annotation file.
- type min_frag_size:
- param min_frag_size:
Minimum fragment size to include. Fragments shorter than this threshold are ignored. Use
Noneto disable the lower bound.- type max_frag_size:
- param max_frag_size:
Maximum fragment size to include. Fragments longer than this threshold are ignored. Use
Noneto disable the upper bound.- type counting_strategy:
Literal['fragment','insertion','paired-insertion']- param counting_strategy:
The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.
- returns:
If
inplace=False, returns an annotated data matrix whose rows are cells and columns are genes or transcripts. Iffile=None, returns an in-memory AnnData object; otherwise returns a backed AnnData object. Ifinplace=True, returnsNoneand updatesadata.Xin place.- rtype:
AnnData
See also
Examples
>>> import snapatac2 as snap >>> fragments = snap.datasets.pbmc500(downsample=True) >>> data = snap.pp.import_fragments( ... fragments, ... chrom_sizes=snap.genome.hg38, ... sorted_by_barcode=False, ... ) >>> gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38) >>> print(gene_mat.shape) >>> promoter_mat = snap.pp.make_gene_matrix( ... data, ... gene_anno=snap.genome.hg38, ... upstream=1000, ... downstream=1000, ... include_gene_body=False, ... )