snapatac2.pp.make_gene_matrix#

snapatac2.pp.make_gene_matrix(adata, gene_anno, *, inplace=False, file=None, backend='hdf5', chunk_size=500, use_x=False, id_type='gene', upstream=2000, downstream=0, include_gene_body=True, transcript_name_key='transcript_name', transcript_id_key='transcript_id', gene_name_key='gene_name', gene_id_key='gene_id', min_frag_size=None, max_frag_size=None, counting_strategy='paired-insertion')[source]#

Generate a cell-by-gene activity matrix.

Use this function after import_fragments to summarize chromatin accessibility over each gene’s regulatory domain. The regulatory domain is defined from the TSS, or from the full gene body when include_gene_body=True, then extended by upstream and downstream base pairs.

Anti-Patterns#

  • Do NOT call this function on an AnnData object that was not created by import_fragments; the input must contain fragment or insertion storage.

  • Do NOT set use_x=True unless .X already contains the binned or peak counts that should be reused for gene aggregation.

  • Do NOT expect file or backend to affect output storage when inplace=True; these arguments are only used when inplace=False.

  • Do NOT change annotation key names unless the GFF/GTF file uses different attribute names than the defaults.

type adata:

AnnData | AnnDataSet

param adata:

The imported AnnData object, or AnnDataSet, containing per-cell fragment or insertion storage.

type gene_anno:

Genome | Path

param gene_anno:

Genome object or path to a gene annotation file in GFF or GTF format.

type inplace:

bool

param inplace:

If True, store the gene activity matrix in adata.X and return None. If False, return a new AnnData object containing the gene activity matrix.

type file:

Path | None

param file:

Output file for the returned AnnData object when inplace=False. If provided, the result is stored as backed AnnData. If None, the result is returned in memory. This argument has no effect when inplace=True.

type backend:

Optional[Literal['hdf5']]

param backend:

Backend used for backed output when file is provided.

type chunk_size:

int

param chunk_size:

Number of genes processed per chunk. Increase this value to improve I/O throughput when memory is sufficient; decrease it to reduce peak memory use.

type use_x:

bool

param use_x:

If True, use the matrix stored in .X to compute gene activity. If False, use the imported fragment or insertion storage.

type id_type:

Literal['gene', 'transcript']

param id_type:

Feature identifier to aggregate by. Use “gene” to aggregate transcripts into genes, or “transcript” to keep transcript-level entries.

type upstream:

int

param upstream:

Number of base pairs to extend upstream of the regulatory domain.

type downstream:

int

param downstream:

Number of base pairs to extend downstream of the regulatory domain.

type include_gene_body:

bool

param include_gene_body:

If True, include the full gene body before extension. If False, use the TSS as the regulatory domain before extension.

type transcript_name_key:

str

param transcript_name_key:

Attribute key for transcript names in the gene annotation file.

type transcript_id_key:

str

param transcript_id_key:

Attribute key for transcript IDs in the gene annotation file.

type gene_name_key:

str

param gene_name_key:

Attribute key for gene names in the gene annotation file.

type gene_id_key:

str

param gene_id_key:

Attribute key for gene IDs in the gene annotation file.

type min_frag_size:

int | None

param min_frag_size:

Minimum fragment size to include. Fragments shorter than this threshold are ignored. Use None to disable the lower bound.

type max_frag_size:

int | None

param max_frag_size:

Maximum fragment size to include. Fragments longer than this threshold are ignored. Use None to disable the upper bound.

type counting_strategy:

Literal['fragment', 'insertion', 'paired-insertion']

param counting_strategy:

The strategy to compute feature counts. It must be one of the following: “fragment”, “insertion”, or “paired-insertion”. “fragment” means the feature counts are assigned based on the number of fragments that overlap with a region of interest. “insertion” means the feature counts are assigned based on the number of insertions that overlap with a region of interest. “paired-insertion” is similar to “insertion”, but it only counts the insertions once if the pair of insertions of a fragment are both within the same region of interest [Miao24]. Note that this parameter has no effect if input are single-end reads.

returns:

If inplace=False, returns an annotated data matrix whose rows are cells and columns are genes or transcripts. If file=None, returns an in-memory AnnData object; otherwise returns a backed AnnData object. If inplace=True, returns None and updates adata.X in place.

rtype:

AnnData

Examples

>>> import snapatac2 as snap
>>> fragments = snap.datasets.pbmc500(downsample=True)
>>> data = snap.pp.import_fragments(
...     fragments,
...     chrom_sizes=snap.genome.hg38,
...     sorted_by_barcode=False,
... )
>>> gene_mat = snap.pp.make_gene_matrix(data, gene_anno=snap.genome.hg38)
>>> print(gene_mat.shape)
>>> promoter_mat = snap.pp.make_gene_matrix(
...     data,
...     gene_anno=snap.genome.hg38,
...     upstream=1000,
...     downstream=1000,
...     include_gene_body=False,
... )