Data structure and usage principles
Using scirpy
Import scirpy as
import scanpy as sc
import scirpy as ir
As a scverse core package, scirpy adheres to the workflow principles laid out by scanpy.:
We decided to handle a few minor points differently to Scanpy:
Plotting functions with inexpensive computations (e.g.
scirpy.pl.clonal_expansion()) call the corresponding tool (scirpy.tl.clonal_expansion()) on-the-fly and don’t store the results in theAnnDataobject.All plotting functions, by default, return a
Axesobject, or a list of such.
Storing AIRR rearrangement data in AnnData
For instructions how to load data into scirpy, see Loading adaptive immune receptor repertoire (AIRR)-sequencing data with Scirpy.
Note
Scirpy’s data structure was fundamentally changed in version 0.13.0. While previously, immune receptor
data was expanded into columns in adata.obs, they are now stored as awkward array in adata.obsm.
Fore more details … # TODO
Scirpy combines the AIRR Rearrangement standard for representing adaptive immune receptor repertoire data with scverse’s AnnData data structure.
AnnData` combines a gene expression matrix (.X), gene-level annotations (.var) and
cell-level annotations (.obs) into a single object. Additionally, matrices aligned to the cells can be stored in .obsm.
The AIRR rearrangement standard defines a set of fields to describe single receptor chains. One cell can have
multiple receptor chains. This relationship is represented as an awkward array stored in adata.obsm["airr"].
The first dimension of the array represents the cells and is aligned to the obs axis of the AnnData object.
The second dimension represents the number of chains per cell and is of variable length. The third dimension
is a :ref:~akward.RecordType` and represents fields defined in the rearrangement standard.
# adata.obsm["airr"]
[
# cell0: 2 chains
[{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
# cell1: 1 chain
[{"locus": "IGH", "junction_aa": "CDGFFA..."}],
# cell2: 0 chains
[],
]
This allows to losslessly store a complete AIRR rearrangement table in AnnData. The purpose of scirpy’s IO module
is to create AnnData objects with the corresponding obsm entries. At this point, chains are neither filtered, nor separated by locus.
This allows any scverse ecosystem package working with AIRR data to adopt the datastructure and to reuse scirpy’s IO functions
if they use scirpy’s receptor model or not.
Chain indices
The scirpy receptor model allows up to two pairs of chains per cell. This representation requires separation of chains by locus into VJ and VDJ chains, and (optionally) filtering non-productive chains.
The index_chains() function serves for this purpose. It creates an additional awkward array
in adata.obsm that has the following structure:
# adata.obsm["chain_indices"]
[
# cell0:
# * 1 VJ chain which is at index 0 in `adata.obsm["airr"][0]`
# * 1 VDJ chain which is at in dex 1 in `adata.obsm["airr"][0]`
# * multichain = False, because the chains does not have more than 2 VJ or VDJ chains
{"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
# cell1:
# * primary VJ chain is at index 0 in `adata.obsm["airr"][1]`
# * secondary VJ chain is at index 2 in `adata.obsm["airr"][1]`
# * etc.
{"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
]
The obsm["chain_indices"] array could easily be adapted to other receptor models. For instance,
a library working with spatial TCR data where each entry in obs corresponds to a “spot” with multiple cells rather
than a single cell could have a list with an arbitrary number of indices for the "VJ" and "VDJ" entries, respectively.
By using a different function for chain indexing, it would also be very straightforward to support non-IMGT loci (e.g.
from other species).
Accessing AIRR data
Any scirpy function accessing AIRR data uses these indices in adata.obsm["chain_indices"] to subset the awkward array in
adata.obsm["airr"]. To retreive AIRR data convenientely, we added the scirpy.get.airr() function. It allows
to specify one or multiple fields and chains and returns a pandas.Series or pandas.DataFrame, respectively:
# retrieve the "locus" field of the primary VJ chain for each cell
>>> ir.get.airr(adata, "locus", "VJ_1")
AAACCTGAGAGTGAGA-1 TRA
AAACCTGAGGCATTGG-1 TRA
AAACCTGCACCTCGTT-1 None
...
By using the airr_context() context manager, fields can be temporarily added to the adata.obs
and used, e.g. for plotting:
with ir.get.airr_context(adata, "locus", "VJ_1"):
sc.pl.umap(adata, color="VJ_1_locus")
Working with multimodal data
The recommended way of working with paired gene expression (GEX) and AIRR data is to use the
MuData container. MuData manages
multiple AnnData objects that share observations and/or features.
After reading in AIRR data with the scirpy IO module and gene expression data with scanpy,
they can be merged in a MuData object. For instance:
adata_airr = ir.io.read_10x_vdj("all_contig_annotations.json")
adata_gex = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
mdata = MuData({"airr": adata_airr, "gex": adata_gex})
Scirpy functions can be applied directly to the MuData object. By default, it will retrieve AIRR data from the "airr"
modality.
ir.tl.chain_qc(mdata)
All functions updating obs inplace update both mdata.obs[f"airr:{key_added}"] and mata.mod["airr"].obs[key_added].
This means you usually do not need to call mdata.update() after running a scirpy function.
Should you prefer to not use MuData, this is entirely possible. All scirpy functions work as well on a single
AnnData object that contains gene expression data in adata.X and AIRR data in adata.obsm["airr"].
Here is one way how the AIRR data can be merged into an AnnData object that already contains gene expression data:
# Map each cell barcode to its respective numeric index (assumes obs_names are unique)
barcode2idx = {barcode: i for i, barcode in enumerate(adata_airr.obs_names)}
# Generate a slice for the awkward array that retrieves the corresponding row from `adata_airr` for each
# barcode in `adata_gex`. `-1` will generate all "None"s for barcodes that are not in `adata_airr`
idx = [barcode2idx.get(barcode, -1) for barcode in adata_gex.obs_names]
adata_gex.obsm["airr"] = adata_airr.obsm["airr"][idx]
Common function parameters
Wherever applicable, scirpy’s functions take the following arguments:
airr_modspecifies the slot inMuDatathat contains theAnnDataobject with AIRR data. This parameter is ignored when working with AnnData directly. Defaults to"airr".
airr_keyspecifies the slot inAnnData.obsmthat contains the awkward array with AIRR data. Defaults to"airr".
chain_idx_keyspecifies the slot inAnnData.obsmthtat contains the chain indices. Defaults to"chain_indices".
inplacedefines if a function stores its results back in the AnnData/MuData object or returns them.
key_addeddefines the key (e.g. in.obs) where a function’s result is stored ifinplace=True.
The DataHandler class ensures that these parameters are handled consistently across functions.
For most use cases you can stick to the default and do not need to modify these parameters.