scirpy.tl.define_clonotypes
- scirpy.tl.define_clonotypes(adata, *, key_added='clone_id', distance_key=None, **kwargs)
Define clonotypes based on CDR3 nucleic acid sequence identity.
As opposed to
define_clonotype_clusters()
which employs a more flexible definition of clonotype clusters, this function stringently defines clonotypes based on nucleic acid sequence identity. Technically, this function is an alias todefine_clonotype_clusters()
with different default parameters.- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.Find connected modules in the graph defined by this distance matrix. Each connected module is considered a clonotype-cluster.
- Parameters
- adata :
AnnData
Annotated data matrix
- receptor_arms
- One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.- dual_ir
- One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
- same_v_gene
Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.- within_group
Enforces clonotypes to have the same group defined by one or multiple grouping variables. Per default, this is set to receptor_type, i.e. clonotypes cannot comprise both B cells and T cells. Set this to receptor_subtype if you don’t want clonotypes to be shared across e.g. gamma-delta and alpha-beta T-cells. You can also set this to any other column in
adata.obs
that contains a grouping, or toNone
, if you want no constraints.- key_added :
str
(default:'clone_id'
) The column name under which the clonotype clusters and cluster sizes will be stored in
adata.obs
and under which the clonotype network will be stored inadata.uns
- inplace
If
True
, adds the results to anndata, otherwise return them.- n_jobs
Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.- chunksize
Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- adata :
- Return type
Tuple
[Series
,Series
,dict
] |None
Optional
[Tuple
[Series
,Series
,dict
]]- Returns
- clonotype
Series
A Series containing the clonotype id for each cell. Will be stored in
adata.obs[key_added]
ifinplace
isTrue
- clonotype_size
Series
A Series containing the number of cells in the respective clonotype for each cell. Will be stored in
adata.obs[f"{key_added}_size"]
ifinplace
isTrue
.- distance_result
dict
- A dictionary containing
distances
: A sparse, pairwise distance matrix between unique receptor configurationscell_indices
: A dict of arrays, containing theadata.obs_names
(cell indices) for each row in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.
- clonotype