scirpy.tl.ir_query
- scirpy.tl.ir_query(adata, reference, *, sequence='aa', metric='identity', receptor_arms='all', dual_ir='any', same_v_gene=False, match_columns=None, key_added=None, distance_key=None, inplace=True, n_jobs=None, chunksize=2000)
Query a referece database for matching immune cell receptors.
Warning
This is an experimental function that may change in the future.
The reference database can either be an immune cell receptor database, or simply another scRNA-seq dataset with some annotations in
.obs
. This function maps all cells to all matching entries from the reference.Requires running
ir_dist()
with the same values forreference
,sequence
andmetric
first.This function is essentially an extension of
define_clonotype_clusters()
to twoAnnData
objects and follows the same logic:- Definition of clonotype(-clusters) follows roughly the following procedure:
Create a list of unique receptor configurations. This is useful to collapse heavily expanded clonotypes, leading to many cells with identical CDR3 sequences, to a single entry.
Compute a pairwise distance matrix of unique receptor configurations. Unique receptor configurations are matched based on the pre-computed VJ and VDJ distance matrices and the parameters of
receptor_arms
,dual_ir
,same_v_gene
andwithin_group
.
- Parameters
- adata :
AnnData
annotated data matrix
- reference :
AnnData
Another
AnnData
object, can be either a second dataset with IR information or a epitope database. Must be the same object used for runningscirpy.pp.ir_dist()
.- sequence : {‘aa’, ‘nt’}
Literal
[‘aa’, ‘nt’] (default:'aa'
) The sequence parameter used when running
scirpy.pp.ir_dist()
- metric : {‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’} |
DistanceCalculator
Union
[Literal
[‘alignment’, ‘identity’, ‘levenshtein’, ‘hamming’],DistanceCalculator
] (default:'identity'
) The metric parameter used when running
scirpy.pp.ir_dist()
- receptor_arms : {‘VJ’, ‘VDJ’, ‘all’, ‘any’}
Literal
[‘VJ’, ‘VDJ’, ‘all’, ‘any’] (default:'all'
) - One of the following options:
If
"any"
, two distances are combined by taking their minimum. If"all"
, two distances are combined by taking their maximum. This is motivated by the hypothesis that a receptor recognizes the same antigen if it has a distance smaller than a certain cutoff. If we require only one of the receptors to match ("any"
) the smaller distance is relevant. If we require both receptors to match ("all"
), the larger distance is relevant.- dual_ir : {‘any’, ‘primary_only’, ‘all’}
Literal
[‘any’, ‘primary_only’, ‘all’] (default:'any'
) - One of the following options:
Distances are combined as for
receptor_arms
.See also Dual IR.
- same_v_gene :
bool
(default:False
) Enforces clonotypes to have the same V-genes. This is useful as the CDR1 and CDR2 regions are fully encoded in this gene. See CDR for more details.
v genes are matched based on the behaviour defined with
receptor_arms
anddual_ir
.- match_columns :
Sequence
[str
] |str
|None
Union
[Sequence
[str
],str
,None
] (default:None
) One or multiple columns in
adata.obs
that must match between query and reference. Use this to e.g. enforce matching cell-types or HLA-types.- key_added :
str
|None
Optional
[str
] (default:None
) Dictionary key under which the resulting distance matrix will be stored in
adata.uns
ifinplace=True
. Defaults toir_query_{name}_{sequence}_{metric}
. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,key_added
needs to be specified manually.- distance_key :
str
|None
Optional
[str
] (default:None
) Key in
adata.uns
where the results ofir_dist()
are stored. Defaults toir_dist_{name}_{sequence}_{metric}
. Ifmetric
is an instance ofscirpy.ir_dist.metrics.DistanceCalculator
,{metric}
defaults tocustom
.{name}
is taken fromreference.uns["DB"]["name"]
. Ifreference
does not have a"DB"
entry,distance_key
needs to be specified manually.- inplace :
bool
(default:True
) If True, store the result in
adata.uns
. Otherwise return a dictionary with the results.- n_jobs :
int
|None
Optional
[int
] (default:None
) Number of CPUs to use for clonotype cluster calculation. Default: use all cores. If the number of cells is smaller than
2 * chunksize
a single worker thread will be used to avoid overhead.- chunksize :
int
(default:2000
) Number of objects to process per chunk. Each worker thread receives data in chunks. Smaller chunks result in a more meaningful progressbar, but more overhead.
- adata :
- Return type
- Returns
- A dictionary containing
distances
: A sparse distance matrix between unique receptor configurations inadata
aund unique receptor configurations inreference
.cell_indices
: A dict of arrays, containing the theadata.obs_names
(cell indices) for each row in the distance matrix.cell_indices_reference
: A dict of arrays, containing thereference.obs_names
for each column in the distance matrix.
If
inplace
isTrue
, this is added toadata.uns[key_added]
.