scirpy.pp.ir_dist

scirpy.pp.ir_dist(adata, reference=None, *, metric='identity', cutoff=None, sequence='nt', key_added=None, inplace=True, n_jobs=None, airr_mod='airr', airr_key='airr', chain_idx_key='chain_indices', airr_mod_ref='airr', airr_key_ref='airr', chain_idx_key_ref='chain_indices')

Computes a sequence-distance metric between all unique VJ CDR3 sequences and between all unique VDJ CDR3 sequences.

This is a required proprocessing step for clonotype definition and clonotype networks and for querying reference databases.

Calculates the full pairwise distance matrix.

Important

  • Distances are offset by 1 to allow efficient use of sparse matrices (\(d' = d+1\)).

  • That means, a distance > cutoff is represented as 0, a distance == 0 is represented as 1, a distance == 1 is represented as 2 and so on.

  • Only returns distances <= cutoff. Larger distances are eliminated from the sparse matrix.

  • Distances are non-negative.

Parameters
adata : AnnData | MuData | DataHandlerUnion[AnnData, MuData, DataHandler]

AnnData or MuData object that contains AIRR information.

reference : AnnData | MuData | DataHandler | NoneUnion[AnnData, MuData, DataHandler, None] (default: None)

Another AnnData object, can be either a second dataset with IR information or a epitope database. If specified, will compute distances between the sequences in adata and the sequences in reference. Otherwise computes pairwise distances of the sequences in adata.

metric : {‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’} | DistanceCalculatorUnion[Literal[‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’], DistanceCalculator] (default: 'identity')

You can choose one of the following metrics:

cutoff : int | NoneOptional[int] (default: None)

All distances > cutoff will be replaced by 0 and eliminated from the sparse matrix. A sensible cutoff depends on the distance metric, you can find information in the corresponding docs. If set to None, the cutoff will be 10 for the alignment metric, and 2 for levenshtein and hamming. For the identity metric, the cutoff is ignored and always set to 0.

sequence : {‘aa’, ‘nt’}Literal[‘aa’, ‘nt’] (default: 'nt')

Compute distances based on amino acid (aa) or nucleotide (nt) sequences.

key_added : str | NoneOptional[str] (default: None)

Dictionary key under which the results will be stored in adata.uns if inplace=True. Defaults to ir_dist_{sequence}_{metric} or ir_dist_{name}_{sequence}_{metric} if reference is specified. If metric is an instance of scirpy.ir_dist.metrics.DistanceCalculator, {metric} defaults to custom. {name} is taken from reference.uns["DB"]["name"]. If reference does not have a "DB" entry, key_added needs to be specified manually.

inplace : bool (default: True)

If true, store the result in adata.uns. Otherwise return a dictionary with the results.

n_jobs : int | NoneOptional[int] (default: None)

Number of cores to use for distance calculation. Passed on to scirpy.ir_dist.metrics.DistanceCalculator.

airr_mod : str (default: 'airr')

Name of the modality with AIRR information is stored in the MuData object. if an AnnData object is passed to the function, this parameter is ignored.

airr_key : str (default: 'airr')

Key under which the AIRR information is stored in adata.obsm as an awkward array.

chain_idx_key : str (default: 'chain_indices')

Key under which the chain indices are stored in adata.obsm. If chain indices are not present, index_chains() is run with default parameters.

airr_mod_ref : str (default: 'airr')

Like airr_mod, but for reference.

airr_key_ref : str (default: 'airr')

Like airr_key, but for reference.

chain_idx_key_ref : str (default: 'chain_indices')

Like chain_idx_key, but for reference.

Return type

dict | NoneOptional[dict]

Returns

Depending on the value of inplace either returns nothing or a dictionary with sparse, pairwise distance matrices for all VJ and VDJ sequences.