snapatac2.pp.make_gene_matrix#

snapatac2.pp.make_gene_matrix(adata, gene_anno, *, inplace=False, file=None, backend='hdf5', chunk_size=500, use_x=False, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, transcript_name_key='transcript_name', transcript_id_key='transcript_id', gene_name_key='gene_name', gene_id_key='gene_id', min_frag_size=None, max_frag_size=None, counting_strategy='paired-insertion')[source]#

Generate a cell-by-gene activity matrix.

Use this function after import_fragments to summarize chromatin accessibility over each gene’s regulatory domain. The regulatory domain is defined from the TSS, or from the full gene body when include_gene_body=True, then extended by upstream and downstream base pairs.

Anti-Patterns#

Do NOT call this function on an AnnData object that was not created by import_fragments; the input must contain fragment or insertion storage.
Do NOT set use_x=True unless .X already contains the binned or peak counts that should be reused for gene aggregation.
Do NOT expect file or backend to affect output storage when inplace=True; these arguments are only used when inplace=False.
Do NOT change annotation key names unless the GFF/GTF file uses different attribute names than the defaults.

type adata:: AnnData | AnnDataSet
param adata:: The imported AnnData object, or AnnDataSet, containing per-cell fragment or insertion storage.
type gene_anno:: Genome | Path
param gene_anno:: Genome object or path to a gene annotation file in GFF or GTF format.
type inplace:: bool
param inplace:: If True, store the gene activity matrix in adata.X and return None. If False, return a new AnnData object containing the gene activity matrix.
type file:: Path | None
param file:: Output file for the returned AnnData object when inplace=False. If provided, the result is stored as backed AnnData. If None, the result is returned in memory. This argument has no effect when inplace=True.
type backend:: Optional[Literal['hdf5']]
param backend:: Backend used for backed output when file is provided.
type chunk_size:: int
param chunk_size:: Number of genes processed per chunk. Increase this value to improve I/O throughput when memory is sufficient; decrease it to reduce peak memory use.
type use_x:: bool
param use_x:: If True, use the matrix stored in .X to compute gene activity. If False, use the imported fragment or insertion storage.
type id_type:: Literal['gene', 'transcript']
param id_type:: Feature identifier to aggregate by. Use “gene” to aggregate transcripts into genes, or “transcript” to keep transcript-level entries.
type upstream:: int
param upstream:: Number of base pairs to extend upstream of the regulatory domain.
type downstream:: int
param downstream:: Number of base pairs to extend downstream of the regulatory domain.
type include_gene_body:: bool
param include_gene_body:: If True, include the full gene body before extension. If False, use the TSS as the regulatory domain before extension.
type transcript_name_key:: str
param transcript_name_key:: Attribute key for transcript names in the gene annotation file.
type transcript_id_key:: str
param transcript_id_key:: Attribute key for transcript IDs in the gene annotation file.
type gene_name_key:: str
param gene_name_key:: Attribute key for gene names in the gene annotation file.
type gene_id_key:: str
param gene_id_key:: Attribute key for gene IDs in the gene annotation file.
type min_frag_size:: int | None
param min_frag_size:: Minimum fragment size to include. Fragments shorter than this threshold are ignored. Use None to disable the lower bound.
type max_frag_size:: int | None
param max_frag_size:: Maximum fragment size to include. Fragments longer than this threshold are ignored. Use None to disable the upper bound.
type counting_strategy:: Literal['fragment', 'insertion', 'paired-insertion']
param counting_strategy:: The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.
returns:: If inplace=False, returns an annotated data matrix whose rows are cells and columns are genes or transcripts. If file=None, returns an in-memory AnnData object; otherwise returns a backed AnnData object. If inplace=True, returns None and updates adata.X in place.
rtype:: AnnData