snapatac2.pp.select_features#

snapatac2.pp.select_features(adata, n_features=500000, filter_lower_quantile=0.005, filter_upper_quantile=0.005, whitelist=None, blacklist=None, max_iter=1, inplace=True, n_jobs=8, verbose=True)[source]#

Select informative genomic features for downstream analysis.

Use this function after generating a tile, peak, or other count matrix to mark features that should be used for dimensionality reduction and graph construction. With the default max_iter=1, features are selected by total accessibility across all cells after lower- and upper-quantile filtering.

Notes

  • This function does not subset the matrix. It stores a boolean mask in .var["selected"] when inplace=True, or returns the mask when inplace=False. Downstream functions use this mask to generate submatrices on the fly. Features that are zero in all cells are always removed. For more discussion, see scverse/SnapATAC2#116.

  • How to set n_features: This value depends on the number of features in the input matrix. It is generally recommended to set n_features to a large value (10% to 50% of the total features) to retain enough features for downstream analysis.

Anti-Patterns#

  • Do NOT expect this function to reduce adata.shape; it only creates or returns a feature mask.

  • Do NOT set max_iter > 1 unless iterative clustering-based feature selection is explicitly required; this mode is slower and is not generally recommended.

  • Do NOT use very large filter_upper_quantile values on datasets with many features unless highly accessible features should be removed aggressively.

  • Do NOT assume blacklist overrides whitelist. If a feature appears in both, the whitelist keeps it.

  • Do NOT set n_features too small; the spectral embedding used in this package usually benefits from a large number of features.

type adata:

AnnData | AnnDataSet | list[AnnData]

param adata:

AnnData, AnnDataSet, or list of AnnData objects containing a count matrix in .X. If a list is provided, feature selection is applied to each object in parallel.

type n_features:

int

param n_features:

Maximum number of features to keep. The final number can be smaller if too few features pass filtering or have nonzero counts.

type filter_lower_quantile:

float

param filter_lower_quantile:

Lower quantile of the feature-count distribution to remove. For example, 0.005 removes the bottom 0.5% features by total count.

type filter_upper_quantile:

float

param filter_upper_quantile:

Upper quantile of the feature-count distribution to remove. For example, 0.005 removes the top 0.5% features by total count. When the number of features is very large, this value can remove many features.

type whitelist:

Path | None

param whitelist:

BED file containing regions to keep. Nonzero features overlapping these regions are kept regardless of other filtering criteria. If a feature is present in both whitelist and blacklist, it is kept.

type blacklist:

Path | None

param blacklist:

BED file containing regions to remove. Features overlapping these regions are removed unless they are also kept by whitelist.

type max_iter:

int

param max_iter:

Number of feature-selection iterations. Use 1 for count-based feature selection. Values greater than 1 perform iterative clustering and feature selection based on variable features found from previous clustering results. This is similar to ArchR but is not generally recommended; see scverse/SnapATAC2#111.

type inplace:

bool

param inplace:

If True, store the boolean mask in adata.var["selected"] and return None. If False, return the mask without modifying adata.

type n_jobs:

int

param n_jobs:

Number of parallel jobs to use when adata is a list.

type verbose:

bool

param verbose:

Whether to print progress messages.

returns:

If inplace=False, returns a boolean feature mask where True means the feature is kept and False means the feature is removed. If adata is a list, returns one mask per object. If inplace=True, returns None and stores the mask in .var["selected"].

rtype:

ndarray | list[ndarray] | None

Examples

>>> import snapatac2 as snap
>>> fragments = snap.datasets.pbmc500(downsample=True)
>>> data = snap.pp.import_fragments(
...     fragments,
...     chrom_sizes=snap.genome.hg38,
...     sorted_by_barcode=False,
... )
>>> snap.pp.add_tile_matrix(data, bin_size=500)
>>> snap.pp.select_features(data, n_features=250000)
>>> print(data.var["selected"].sum())