scirpy.tl.ir_query
- scirpy.tl.ir_query(adata, reference, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, match_columns=None, key_added=None, distance_key=None, inplace=True, n_jobs=None, chunksize=2000)
Query a referece database for matching immune cell receptors.
Warning
This is an experimental function that may change in the future.
The reference database can either be a imune cell receptor database, or simply another scRNA-seq dataset with some annotations in
.obs
. This function maps all cells to all matching entries from the reference.Requires funning
ir_dist()
with the same values forreference
,sequence
andmetric
first.This function is essentially an extension of
define_clonotype_clusters()
to twoAnnData
objects and follows the same logic:- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.
- Parameters
- adata :
AnnData
AnnData
annotated data matrix
- reference :
AnnData
AnnData
Another
AnnData
object, can be either a second dataset with IR information or a epitope database. Must be the same object used for runningscirpy.pp.ir_dist()
.- sequence : {‘aa’, ‘nt’}
Literal
[‘aa’, ‘nt’] (default:'aa'
) The sequence parameter used when running
scirpy.pp.ir_dist()
- metric : {‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’} |
DistanceCalculator
Union
[Literal
[‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’],DistanceCalculator
] (default:'identity'
) The metric parameter used when running
scirpy.pp.ir_dist()
- receptor_arms : {‘VJ’, ‘VDJ’, ‘all’, ‘any’}
Literal
[‘VJ’, ‘VDJ’, ‘all’, ‘any’] (default:'all'
) - One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.- dual_ir : {‘any’, ‘primary_only’, ‘all’}
Literal
[‘any’, ‘primary_only’, ‘all’] (default:'any'
) - One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
- same_v_gene :
bool
bool
(default:False
) Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.- match_columns :
Sequence
[str
] |str
|None
Union
[Sequence
[str
],str
,None
] (default:None
) One or multiple columns in
adata.obs
that must match between query and reference. Use this to e.g. enforce matching cell-types or HLA-types.- key_added :
str
|None
Optional
[str
] (default:None
) Dictionary key under which the resulting distance matrix will be stored in
adata.uns
ifinplace=True
. Defaults toir_query_{name}_{sequence}_{metric}
. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,key_added
needs to be specified manually.- distance_key :
str
|None
Optional
[str
] (default:None
) Key in
adata.uns
where the results ofir_dist()
are stored. Defaults toir_dist_{name}_{sequence}_{metric}
. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,distance_key
needs to be specified manually.- inplace :
bool
bool
(default:True
) If True, store the result in
adata.uns
. Otherwise return a dictionary with the results.- n_jobs :
int
|None
Optional
[int
] (default:None
) Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.- chunksize :
int
int
(default:2000
) Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- adata :
- Return type
- Returns
- A dictionary containing
distances
: A sparse distance matrix between unique receptor configurations inadata
aund unique receptor configurations inreference
.cell_indices
: A dict of arrays, containing the theadata.obs_names
(cell indices) for each row in the distance matrix.cell_indices_reference
: A dict of arrays, containing thereference.obs_names
for each column in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.