snapatac2.pp.select_features#

snapatac2.pp.select_features(adata, n_features=500000, filter_lower_quantile=0.005, filter_upper_quantile=0.005, whitelist=None, blacklist=None, max_iter=1, inplace=True, n_jobs=8, verbose=True)[source]#

Select informative genomic features for downstream analysis.

Use this function after generating a tile, peak, or other count matrix to mark features that should be used for dimensionality reduction and graph construction. With the default max_iter=1, features are selected by total accessibility across all cells after lower- and upper-quantile filtering.

Notes

This function does not subset the matrix. It stores a boolean mask in .var["selected"] when inplace=True, or returns the mask when inplace=False. Downstream functions use this mask to generate submatrices on the fly. Features that are zero in all cells are always removed. For more discussion, see scverse/SnapATAC2#116.
How to set n_features: This value depends on the number of features in the input matrix. It is generally recommended to set n_features to a large value (10% to 50% of the total features) to retain enough features for downstream analysis.

Anti-Patterns#

Do NOT expect this function to reduce adata.shape; it only creates or returns a feature mask.
Do NOT set max_iter > 1 unless iterative clustering-based feature selection is explicitly required; this mode is slower and is not generally recommended.
Do NOT use very large filter_upper_quantile values on datasets with many features unless highly accessible features should be removed aggressively.
Do NOT assume blacklist overrides whitelist. If a feature appears in both, the whitelist keeps it.
Do NOT set n_features too small; the spectral embedding used in this package usually benefits from a large number of features.

type adata:: AnnData | AnnDataSet | list[AnnData]
param adata:: AnnData, AnnDataSet, or list of AnnData objects containing a count matrix in .X. If a list is provided, feature selection is applied to each object in parallel.
type n_features:: int
param n_features:: Maximum number of features to keep. The final number can be smaller if too few features pass filtering or have nonzero counts.
type filter_lower_quantile:: float
param filter_lower_quantile:: Lower quantile of the feature-count distribution to remove. For example, 0.005 removes the bottom 0.5% features by total count.
type filter_upper_quantile:: float
param filter_upper_quantile:: Upper quantile of the feature-count distribution to remove. For example, 0.005 removes the top 0.5% features by total count. When the number of features is very large, this value can remove many features.
type whitelist:: Path | None
param whitelist:: BED file containing regions to keep. Nonzero features overlapping these regions are kept regardless of other filtering criteria. If a feature is present in both whitelist and blacklist, it is kept.
type blacklist:: Path | None
param blacklist:: BED file containing regions to remove. Features overlapping these regions are removed unless they are also kept by whitelist.
type max_iter:: int
param max_iter:: Number of feature-selection iterations. Use 1 for count-based feature selection. Values greater than 1 perform iterative clustering and feature selection based on variable features found from previous clustering results. This is similar to ArchR but is not generally recommended; see scverse/SnapATAC2#111.
type inplace:: bool
param inplace:: If True, store the boolean mask in adata.var["selected"] and return None. If False, return the mask without modifying adata.
type n_jobs:: int
param n_jobs:: Number of parallel jobs to use when adata is a list.
type verbose:: bool
param verbose:: Whether to print progress messages.
returns:: If inplace=False, returns a boolean feature mask where True means the feature is kept and False means the feature is removed. If adata is a list, returns one mask per object. If inplace=True, returns None and stores the mask in .var["selected"].
rtype:: ndarray | list[ndarray] | None

Examples

>>> import snapatac2 as snap
>>> fragments = snap.datasets.pbmc500(downsample=True)
>>> data = snap.pp.import_fragments(
...     fragments,
...     chrom_sizes=snap.genome.hg38,
...     sorted_by_barcode=False,
... )
>>> snap.pp.add_tile_matrix(data, bin_size=500)
>>> snap.pp.select_features(data, n_features=250000)
>>> print(data.var["selected"].sum())