[1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, "../..")
import scirpy as ir
import scanpy as sc
from glob import glob
import pandas as pd
import tarfile
import anndata
import warnings
from numba import NumbaPerformanceWarning
# ignore numba performance warnings
warnings.filterwarnings("ignore", category=NumbaPerformanceWarning)
# suppress "storing XXX as categorical" warnings.
anndata.logging.anndata_logger.setLevel("ERROR")
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/setuptools_scm/git.py:68: UserWarning: "/home/runner/work/scirpy/scirpy" is shallow and may cause errors
warnings.warn('"{}" is shallow and may cause errors'.format(wd.path))
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/traitlets/traitlets.py:3036: FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.
FutureWarning,
Loading adaptive Immune Receptor (IR)-sequencing data with Scirpy¶
In this notebook, we demonstrate how single-cell IR-data can be imported into
an AnnData
object for the use with Scirpy. To learn more about
AnnData and how Scirpy makes use of it, check out the Data structure section.
The example data used in this notebook are available from the Scirpy repository.
Important
The Scirpy data model
Currently, the Scirpy data model has the following constraints:
BCR and TCR chains are supported. Chain loci must be valid Chain locus, i.e. one of
TRA
,TRG
,IGK
, orIGL
(chains with a VJ junction) orTRB
,TRD
, orIGH
(chains with a VDJ junction). Other chains are discarded.Non-productive chains are removed. CellRanger, TraCeR, and the AIRR rearrangment format flag these cells appropriately. When reading custom formats, you need to pass the flag explicitly or filter the chains beforehand.
Each chain can contain up to two
VJ
and twoVDJ
chains (Dual IR). Excess chains are removed (those with lowest read count/UMI count) and cells flagged as Multichain-cell.
For more information, see Immune receptor (IR) model.
Note
:term:`IR` quality control
After importing the data, we recommend running the
scirpy.tl.chain_qc()
function. It willidentify the Receptor type and Receptor subtype and flag cells as
ambiguous
that cannot unambigously be assigned to a certain receptor (sub)type, andflag cells with orphan chains (i.e. cells with only a single detected cell) and multichain-cells (i.e. cells with more than two full pairs of VJ- and VDJ-chains).
We recommend excluding multichain- and ambiguous cells as these likely represent doublets
Based on the orphan chain flags, the corresponding cells can be excluded. Alternatively, these cells can be matched to clonotypes on a single chain only, by using the
receptor_arms="any"
parameter when runningscirpy.tl.define_clonotypes()
.
Loading data from 10x Genomics CellRanger, TraCeR, BraCer or AIRR-compliant tools¶
We provide convenience functions to load data from CellRanger, TraCeR, or BraCeR with a single function call, supporting both data generated on the 10x and Smart-seq2 sequencing platforms, respectively. Moreover, we support importing data in the community-standard AIRR rearrangement schema.
|
Read IR data from 10x Genomics cell-ranger output. |
|
Read data from TraCeR ([SLonnbergP+16]). |
|
|
|
Read AIRR-compliant data. |
Read 10x data¶
With read_10x_vdj()
we can load filtered_contig_annotations.csv
or contig_annotations.json
files as they are produced by CellRanger.
Here, we demonstrate how to load paired single cell transcriptomics and TCR sequencing data from COVID19 patients
from GSE145926 ([LLY+20]).
[2]:
# Load the TCR data
adata_tcr = ir.io.read_10x_vdj(
"example_data/liao-2019-covid19/GSM4385993_C144_filtered_contig_annotations.csv.gz"
)
# Load the associated transcriptomics data
adata = sc.read_10x_h5(
"example_data/liao-2019-covid19/GSM4339772_C144_filtered_feature_bc_matrix.h5"
)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if not is_categorical(df_full[k]):
This particular sample only has a detected TCR for a small fraction of the cells:
[3]:
adata_tcr.shape
[3]:
(136, 0)
[4]:
adata.shape
[4]:
(3716, 33539)
Next, we integrate both the TCR and the transcriptomics data into a single anndata.AnnData
object
using scirpy.pp.merge_with_ir()
:
[5]:
ir.pp.merge_with_ir(adata, adata_tcr)
Now, we can use TCR-related variables together with the gene expression data. Here, we visualize the cells with a detected TCR on the UMAP plot. It is reassuring that the TCRs coincide with the T-cell marker gene CD3.
[6]:
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/numba/np/ufunc/parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.
warnings.warn(problem)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
Read Smart-seq2 data processed with TraCeR¶
TraCeR ([SLonnbergP+16]) is a method commonly used to extract TCR sequences from data generated with Smart-seq2 or other full-length single-cell sequencing protocols. Nf-core provides a full pipeline for processing Smart-seq2 sequencing data.
The scirpy.io.read_tracer()
function obtains its TCR information from the .pkl
file
in the filtered_TCR_seqs
folder TraCeR generates for each cell.
For this example, we load the ~500 cells from triple-negative breast cancer patients from GSE75688 ([CEL+17]). The raw data has been processed using the aforementioned Smart-seq2 pipeline from nf-core.
[7]:
# extract data
with tarfile.open("example_data/chung-park-2017.tar.bz2", "r:bz2") as tar:
tar.extractall("example_data/chung-park-2017")
[8]:
# Load transcriptomics data from count matrix
expr_chung = pd.read_csv("example_data/chung-park-2017/counts.tsv", sep="\t")
# anndata needs genes in columns and samples in rows
expr_chung = expr_chung.set_index("Geneid").T
adata = sc.AnnData(expr_chung)
adata.shape
[8]:
(563, 23438)
[9]:
# Load TCR data and merge it with transcriptomics data
adata_tcr = ir.io.read_tracer("example_data/chung-park-2017/tracer/")
ir.pp.merge_with_ir(adata, adata_tcr)
[10]:
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=3000)
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if not is_categorical(df_full[k]):
Read an AIRR-compliant rearrangement table¶
We generated example data using immuneSIM ([WAY+20]). The data consists of 100 cells and does not include transcriptomics data.
The rearrangement tables are often organized into separate tables per chain. Therefore, scirpy.io.read_airr()
supports
specifiying multiple tsv
files at once. This would have the same effect as concatenating them before
the import.
[11]:
adata = ir.io.read_airr(
[
"example_data/immunesim_airr/immunesim_tra.tsv",
"example_data/immunesim_airr/immunesim_trb.tsv",
]
)
ir.tl.chain_qc(adata)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
The dataset does not come with transcriptomics data. We can, therefore, not show the UMAP plot highlighting cells with TCRs, but we can still use scirpy to analyse it. Below, we visualize the clonotype network connecting cells with similar CDR3 sequences.
Note: The cutoff of 25 was chosen for demonstration purposes on this small sample dataset. Usually a smaller cutoff is more approriate.
[12]:
ir.pp.ir_neighbors(
adata,
metric="alignment",
sequence="aa",
cutoff=25,
receptor_arms="any",
dual_ir="primary_only",
)
100%|██████████| 3/3 [00:00<00:00, 26.61it/s]
100%|██████████| 3/3 [00:00<00:00, 24.07it/s]
100%|██████████| 142/142 [00:00<00:00, 60484.53it/s]
100%|██████████| 147/147 [00:00<00:00, 122504.01it/s]
100%|██████████| 278/278 [00:00<00:00, 63120.04it/s]
[13]:
ir.tl.define_clonotype_clusters(adata, metric="alignment", sequence="aa")
ir.tl.clonotype_network(adata, layout="fr", metric="alignment", sequence="aa")
ir.pl.clonotype_network(adata, color="ct_cluster_aa_alignment", panel_size=(4, 4))
[13]:
array([<AxesSubplot:title={'center':'ct_cluster_aa_alignment'}, xlabel='clonotype_network1', ylabel='clonotype_network2'>],
dtype=object)
Creating AnnData objects from other formats¶
Often, immune receptor (IR) data are just provided as a simple table listing the CDR3 sequences for each cell.
We provide a generic data structure for cells with IRs, which can then be converted into
an AnnData
object.
|
Data structure for a Cell with immune receptors. |
|
Data structure for an immune cell receptor chain. |
|
If you believe you are working with a commonly used format, consider sending a feature request
for a read_XXX
function.
For this example, we again load the triple-negative breast cancer data from [CEL+17]. However, this time, we retrieve the TCR data from a separate summary table containing the TCR information (we generated this table for the sake of the example, but it could as well be a supplementary file from the paper).
Such a table typically contains information about
CDR3 sequences (amino acid and/or nucleotide)
The Chain locus, e.g.
TRA
,TRB
, orIGH
.expression of the receptor chain (e.g. count, UMI, transcripts per million (TPM))
the V(D)J genes for each chain
information if the chain is productive.
[14]:
tcr_table = pd.read_csv(
"example_data/chung-park-2017/tcr_table.tsv",
sep="\t",
index_col=0,
na_values=["None"],
true_values=["True"],
)
tcr_table
[14]:
cell_id | cdr3_alpha | cdr3_nt_alpha | count_alpha | v_alpha | j_alpha | cdr3_beta | cdr3_nt_beta | count_beta | v_beta | d_beta | j_beta | productive_alpha | productive_beta | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SRR2973278 | AVSDIHASGGSYIPT | GCTGTTTCGGATATTCATGCATCAGGAGGAAGCTACATACCTACA | 9.29463 | TRAV21 | TRAJ6 | ASSWWQNTEAF | GCCAGCAGCTGGTGGCAGAACACTGAAGCTTTC | 37.5984 | TRBV5-1 | NaN | TRBJ1-1 | True | True |
1 | SRR2973305 | AVVTGANSKLT | GCTGTGGTAACTGGAGCCAATAGTAAGCTGACA | 89.45740 | TRAV22 | TRAJ56 | NaN | NaN | NaN | NaN | NaN | NaN | True | True |
2 | SRR2973371 | ALKRTGNTPLV | GCTCTGAAAAGAACAGGAAACACACCTCTTGTC | 431.96500 | TRAV9-2 | TRAJ29 | ASRSRDSGEPQH | GCCAGCAGGAGCAGGGACAGCGGAGAGCCCCAGCAT | 952.0230 | TRBV10-2 | TRBD1 | TRBJ1-5 | True | True |
3 | SRR2973377 | ATDPETSGSRLT | GCTACGGACCCAGAAACCAGTGGCTCTAGGTTGACC | 772.43600 | TRAV17 | TRAJ58 | NaN | NaN | NaN | NaN | NaN | NaN | True | True |
4 | SRR2973403 | AVRGATDSWGKFQ | GCTGTGAGAGGAGCAACTGACAGCTGGGGGAAATTCCAG | 95.63640 | TRAV3 | TRAJ24 | SVQTSEYEQY | AGCGTCCAGACTAGCGAGTACGAGCAGTAC | 205.8330 | TRBV29-1 | TRBD2 | TRBJ2-7 | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
85 | SRR5023618 | NaN | NaN | NaN | NaN | NaN | ASSDSPFSSYNEQF | GCCAGCAGTGACTCGCCCTTTAGCTCCTACAATGAGCAGTTC | 864.4550 | TRBV6-4 | NaN | TRBJ2-1 | True | True |
86 | SRR5023621 | AENSGGSNYKLT | GCAGAGAATAGTGGAGGTAGCAACTATAAACTGACA | 512.63000 | TRAV13-2 | TRAJ53 | ASSPDGGGGYT | GCCAGCAGCCCTGATGGGGGAGGGGGCTACACC | 805.2010 | TRBV7-3 | TRBD2 | TRBJ1-2 | True | True |
87 | SRR5023626 | ALRIGSNYKLT | GCTCTGAGAATCGGTAGCAACTATAAACTGACA | 12.51630 | TRAV9-2 | TRAJ53 | NaN | NaN | NaN | NaN | NaN | NaN | True | True |
88 | SRR5023633 | NaN | NaN | NaN | NaN | NaN | ASGLGQSVGGTQY | GCTAGTGGCCTAGGGCAGTCGGTAGGAGGGACCCAGTAC | 18.4273 | TRBV12-5 | TRBD2 | TRBJ2-5 | True | True |
89 | SRR5023639 | NaN | NaN | NaN | NaN | NaN | ASSKGSLGPAGELF | GCCAGCAGCAAAGGATCGCTGGGGCCCGCCGGGGAGCTGTTT | 905.9260 | TRBV21-1 | TRBD1 | TRBJ2-2 | True | True |
90 rows × 14 columns
Our task is now to dissect the table into IrCell
and IrChain
objects.
Each IrCell
can have an arbitrary number of chains.
When converting the IrCell
objects into an AnnData
object,
scirpy will only retain at most two alpha and two beta chains per cell and flag cells which exceed
this number as multichain cells. For more information, check the page about our Immune receptor (IR) model.
[15]:
tcr_cells = []
for idx, row in tcr_table.iterrows():
cell = ir.io.IrCell(cell_id=row["cell_id"])
alpha_chain = ir.io.IrChain(
locus="TRA",
cdr3=row["cdr3_alpha"],
cdr3_nt=row["cdr3_nt_alpha"],
expr=row["count_alpha"],
v_gene=row["v_alpha"],
j_gene=row["j_alpha"],
is_productive=row["productive_alpha"],
)
beta_chain = ir.io.IrChain(
locus="TRB",
cdr3=row["cdr3_beta"],
cdr3_nt=row["cdr3_nt_beta"],
expr=row["count_beta"],
v_gene=row["v_beta"],
d_gene=row["d_beta"],
j_gene=row["j_beta"],
is_productive=row["productive_beta"],
)
cell.add_chain(alpha_chain)
cell.add_chain(beta_chain)
tcr_cells.append(cell)
Now, we can convert the list of IrCell
objects using scirpy.io.from_ir_objs()
.
[16]:
adata_tcr = ir.io.from_ir_objs(tcr_cells)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
[17]:
# We can re-use the transcriptomics data from above...
adata = sc.AnnData(expr_chung)
# ... and merge it with the TCR data
ir.pp.merge_with_ir(adata, adata_tcr)
[18]:
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=3000)
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if not is_categorical(df_full[k]):
Combining multiple samples¶
It is quite common that the sequncing data is split up in multiple samples.
To combine them into a single object, we load each sample independently using one of the approaches described
in this document. Then, we combine them using anndata.AnnData.concatenate()
.
Here is a full example loading and combining three samples from the COVID19 study by [LLY+20].
[19]:
# define sample metadata. Usually read from a file.
samples = {
"C144": {"group": "mild"},
"C146": {"group": "severe"},
"C149": {"group": "healthy control"},
}
[20]:
# Create a list of AnnData objects (one for each sample)
adatas = []
for sample, sample_meta in samples.items():
gex_file = glob(f"example_data/liao-2019-covid19/*{sample}*.h5")[0]
tcr_file = glob(f"example_data/liao-2019-covid19/*{sample}*.csv.gz")[0]
adata = sc.read_10x_h5(gex_file)
adata_tcr = ir.io.read_10x_vdj(tcr_file)
ir.pp.merge_with_ir(adata, adata_tcr)
adata.obs["sample"] = sample
adata.obs["group"] = sample_meta["group"]
# concatenation only works with unique gene names
adata.var_names_make_unique()
adatas.append(adata)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1094: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if not is_categorical(df_full[k]):
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
[21]:
# Merge anndata objects
adata = adatas[0].concatenate(adatas[1:])
The data is now integrated in a single object. Again, the detected TCRs coincide with CD3E
gene expression. We clearly observe batch effects between the samples – for a meaningful downstream analysis further processing steps such as highly-variable gene filtering and batch correction are necessary.
[22]:
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E", "sample"])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])