snapatac2.pp.select_features#
- snapatac2.pp.select_features(adata, n_features=500000, filter_lower_quantile=0.005, filter_upper_quantile=0.005, whitelist=None, blacklist=None, max_iter=1, inplace=True, n_jobs=8, verbose=True)[source]#
Select informative genomic features for downstream analysis.
Use this function after generating a tile, peak, or other count matrix to mark features that should be used for dimensionality reduction and graph construction. With the default
max_iter=1, features are selected by total accessibility across all cells after lower- and upper-quantile filtering.Notes
This function does not subset the matrix. It stores a boolean mask in
.var["selected"]wheninplace=True, or returns the mask wheninplace=False. Downstream functions use this mask to generate submatrices on the fly. Features that are zero in all cells are always removed. For more discussion, see scverse/SnapATAC2#116.How to set n_features: This value depends on the number of features in the input matrix. It is generally recommended to set n_features to a large value (10% to 50% of the total features) to retain enough features for downstream analysis.
Anti-Patterns#
Do NOT expect this function to reduce
adata.shape; it only creates or returns a feature mask.Do NOT set
max_iter > 1unless iterative clustering-based feature selection is explicitly required; this mode is slower and is not generally recommended.Do NOT use very large
filter_upper_quantilevalues on datasets with many features unless highly accessible features should be removed aggressively.Do NOT assume blacklist overrides whitelist. If a feature appears in both, the whitelist keeps it.
Do NOT set n_features too small; the spectral embedding used in this package usually benefits from a large number of features.
- type adata:
AnnData|AnnDataSet|list[AnnData]- param adata:
AnnData, AnnDataSet, or list of AnnData objects containing a count matrix in
.X. If a list is provided, feature selection is applied to each object in parallel.- type n_features:
- param n_features:
Maximum number of features to keep. The final number can be smaller if too few features pass filtering or have nonzero counts.
- type filter_lower_quantile:
- param filter_lower_quantile:
Lower quantile of the feature-count distribution to remove. For example,
0.005removes the bottom 0.5% features by total count.- type filter_upper_quantile:
- param filter_upper_quantile:
Upper quantile of the feature-count distribution to remove. For example,
0.005removes the top 0.5% features by total count. When the number of features is very large, this value can remove many features.- type whitelist:
- param whitelist:
BED file containing regions to keep. Nonzero features overlapping these regions are kept regardless of other filtering criteria. If a feature is present in both
whitelistandblacklist, it is kept.- type blacklist:
- param blacklist:
BED file containing regions to remove. Features overlapping these regions are removed unless they are also kept by
whitelist.- type max_iter:
- param max_iter:
Number of feature-selection iterations. Use
1for count-based feature selection. Values greater than1perform iterative clustering and feature selection based on variable features found from previous clustering results. This is similar to ArchR but is not generally recommended; see scverse/SnapATAC2#111.- type inplace:
- param inplace:
If
True, store the boolean mask inadata.var["selected"]and returnNone. IfFalse, return the mask without modifyingadata.- type n_jobs:
- param n_jobs:
Number of parallel jobs to use when
adatais a list.- type verbose:
- param verbose:
Whether to print progress messages.
- returns:
If
inplace=False, returns a boolean feature mask whereTruemeans the feature is kept andFalsemeans the feature is removed. Ifadatais a list, returns one mask per object. Ifinplace=True, returnsNoneand stores the mask in.var["selected"].- rtype:
Examples
>>> import snapatac2 as snap >>> fragments = snap.datasets.pbmc500(downsample=True) >>> data = snap.pp.import_fragments( ... fragments, ... chrom_sizes=snap.genome.hg38, ... sorted_by_barcode=False, ... ) >>> snap.pp.add_tile_matrix(data, bin_size=500) >>> snap.pp.select_features(data, n_features=250000) >>> print(data.var["selected"].sum())