scirpy.tl.define_clonotype_clusters
- scirpy.tl.define_clonotype_clusters(adata, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, within_group='receptor_type', key_added=None, partitions='connected', resolution=1, n_iterations=5, distance_key=None, inplace=True, n_jobs=None, chunksize=2000)
Define clonotype clusters.
As opposed to
define_clonotypes()
which employs a more stringent definition of clonotypes, this function flexibly defines clonotype clusters based on amino acid or nucleic acid sequence identity or similarity.Requires running
ir_dist()
with the samesequence
andmetric
values first.- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.
- Parameters
- adata :
AnnData
Annotated data matrix
- sequence : {‘aa’, ‘nt’}
Literal
[‘aa’, ‘nt’] (default:'aa'
) The sequence parameter used when running
scirpy.pp.ir_dist()
- metric : {‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’} |
DistanceCalculator
Union
[Literal
[‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’],DistanceCalculator
] (default:'identity'
) The metric parameter used when running
scirpy.pp.ir_dist()
- receptor_arms : {‘VJ’, ‘VDJ’, ‘all’, ‘any’}
Literal
[‘VJ’, ‘VDJ’, ‘all’, ‘any’] (default:'all'
) - One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.- dual_ir : {‘primary_only’, ‘all’, ‘any’}
Literal
[‘primary_only’, ‘all’, ‘any’] (default:'any'
) - One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
- same_v_gene :
bool
(default:False
) Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.
- adata :
- within_group
Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in
adata.obs
that contains a grouping, or toNone
, if you want no constraints.- key_added
The column name under which the clonotype clusters and cluster sizes will be stored in
adata.obs
and under which the clonotype network will be stored inadata.uns
.Defaults to
cc_{sequence}_{metric}
, e.g.cc_aa_levenshtein
, wherecc
stands for “clonotype cluster”.The clonotype sizes will be stored in
{key_added}_size
, e.g.cc_aa_levenshtein_size
.The clonotype x clonotype network will be stored in
{key_added}_dist
, e.g.cc_aa_levenshtein_dist
.
- partitions
How to find graph partitions that define a clonotype. Possible values are
leiden
, for using the “Leiden” algorithm andconnected
to find fully connected sub-graphs.The difference is that the Leiden algorithm further divides fully connected subgraphs into highly-connected modules.
- resolution
resolution
parameter for the leiden algorithm.- n_iterations
n_iterations
parameter for the leiden algorithm.- distance_key
Key in
adata.uns
where the sequence distances are stored. This defaults toir_dist_{sequence}_{metric}
.- inplace
If
True
, adds the results to anndata, otherwise returns them.- n_jobs
Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.- chunksize
Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- Return type
Tuple
[Series
,Series
,dict
] |None
Optional
[Tuple
[Series
,Series
,dict
]]- Returns
- clonotype
Series
A Series containing the clonotype id for each cell. Will be stored in
adata.obs[key_added]
ifinplace
isTrue
- clonotype_size
Series
A Series containing the number of cells in the respective clonotype for each cell. Will be stored in
adata.obs[f"{key_added}_size"]
ifinplace
isTrue
.- distance_result
dict
- A dictionary containing
distances
: A sparse, pairwise distance matrix between unique receptor configurationscell_indices
: A dict of arrays, containing theadata.obs_names
(cell indices) for each row in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.
- clonotype