Analysis of 3k T cells from cancer¶

In this tutorial, we re-analyze single-cell TCR/RNA-seq data from Wu et al. ([WMdA+20]) generated on the 10x Genomics platform. The original dataset consists of >140k T cells from 14 treatment-naive patients across four different types of cancer. For this tutorial, to speed up computations, we use a downsampled version of 3k cells.

[1]:

# This cell is for development only. Don't copy this to your notebook.
%load_ext autoreload
%autoreload 2

[2]:

import numpy as np
import pandas as pd
import scanpy as sc
import scirpy as ir
from matplotlib import pyplot as plt, cm as mpl_cm
from cycler import cycler

sc.set_figure_params(figsize=(4, 4))
sc.settings.verbosity = 2  # verbosity: errors (0), warnings (1), info (2), hints (3)

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/traitlets/traitlets.py:3044: FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.
  warn(

[3]:

sc.logging.print_header()

scanpy==1.8.1 anndata==0.7.6 umap==0.5.1 numpy==1.20.3 scipy==1.7.1 pandas==1.3.3 scikit-learn==0.24.2 statsmodels==0.13.0rc0 python-igraph==0.9.6 pynndescent==0.5.4

The dataset ships with the scirpy package. We can conveniently load it from the datasets module:

[4]:

adata = ir.datasets.wu2020_3k()

try downloading from url
https://github.com/icbi-lab/scirpy/releases/download/d0.1.0/wu2020_3k.h5ad
... this may take a while but only happens once
creating directory /home/runner/work/scirpy/scirpy/docs/tutorials/data/ for saving data

100%|██████████| 16.0M/16.0M [00:00<00:00, 124MB/s]

Note

adata is a regular AnnData object with additional, Immune Receptor (IR)-specific columns in obs that are named according to the AIRR Rearrangement Schema. For more information, check the page about

Scirpy’s data structure, and

Scirpy’s working model of immune receptors.

Warning

scirpy’s data structure has updated in v0.7.0 to be fully AIRR-compliant.

AnnData objects created with older versions of scirpy can be upgraded with scirpy.io.upgrade_schema() to be compatible with the latest version of scirpy.

[5]:

adata.shape

[5]:

(3000, 30727)

[6]:

adata.obs

[6]:

	cluster_orig	patient	sample	source	clonotype_orig	multi_chain	IR_VJ_1_locus	IR_VJ_2_locus	IR_VDJ_1_locus	IR_VDJ_2_locus	...	IR_VJ_2_c_call	IR_VDJ_1_c_call	IR_VDJ_2_c_call	IR_VJ_1_junction	IR_VJ_2_junction	IR_VDJ_1_junction	IR_VDJ_2_junction	has_ir	batch	extra_chains
LN1_GTAGGCCAGCGTAGTG-1-19	4.4-FOS	Lung1	LN1	NAT	lung1.tn.C223	False	nan	nan	TRB	TRB	...	nan	TRBC2	TRBC2	None	None	TGTGCCAGCAGCTTAATGCGGCTAGCGGGAGATACGCAGTATTTT	TGTGCAAGTCGCTTAGCGGTTTTATCGACTAGCGGGAGTGTCGGAG...	True	19	NaN
RN2_AGAGCGACAGATTGCT-1-27	4.4-FOS	Renal2	RN2	NAT	renal2.tnb.C1362	False	TRA	nan	TRB	nan	...	nan	TRBC1	nan	TGTGCTGTGAGGGGGAATAACAATGCCAGACTCATGTTT	None	TGTGCCAGCAGCTTTGGAACGGTGGCTGAAGCTTTCTTT	None	True	27	NaN
LN1_GTCATTTCAATGAAAC-1-19	8.2-Tem	Lung1	LN1	NAT	lung1.tn.C25	False	TRA	nan	TRB	nan	...	nan	TRBC1	nan	TGTGCTGTGAGGTTGGGTAACCAGTTCTATTTT	None	TGCAGTGCTAGAGATGGAGGGGGGGGGAACACTGAAGCTTTCTTT	None	True	19	NaN
LN2_GACACGCAGGTAGCTG-2-2	8.6-KLRB1	Lung2	LN2	NAT	lung2.tn.C2452	False	nan	nan	TRB	nan	...	nan	TRBC2	nan	None	None	TGTGCCAGCAGCCAAGGTCAGGGACAGGATTTTAACTACGAGCAGT...	None	True	2	NaN
LN2_GCACTCTCAGGGATTG-2-2	4.4-FOS	Lung2	LN2	NAT	lung2.tn.C5631	False	TRA	nan	TRB	nan	...	nan	TRBC1	nan	TGTGCAGCAAGCGACCCCACGGTCGAGGCAGGAACTGCTCTGATCTTT	None	TGTGCCAGCAGCTTGACCGTTAACACTGAAGCTTTCTTT	None	True	2	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
RT3_GCAGTTAGTATGAAAC-1-6	4.2-RPL32	Renal3	RT3	Tumor	renal3.tnb.C176	False	TRA	nan	TRB	nan	...	nan	TRBC2	nan	TGTGCTGCGATGGATAGCAACTATCAGTTAATCTGG	None	TGTGCCACCAAGGACAGGGAAGACACCGGGGAGCTGTTTTTT	None	True	6	NaN
LT1_GACGTGCTCTCAAGTG-1-24	8.2-Tem	Lung1	LT1	Tumor	lung1.tn.C151	False	TRA	nan	nan	nan	...	nan	nan	nan	TGTGCTTATAGGAGTTCCCTTGGTGGTGCTACAAACAAGCTCATCTTT	None	None	None	True	24	NaN
ET3_GCTGGGTAGACCTTTG-1-3	3.1-MT	Endo3	ET3	Tumor	endo3.tn.C76	False	nan	nan	TRB	nan	...	nan	TRBC2	nan	None	None	TGTGCCAGCAGCCGGACAGGGGGGGATTCCGGGGAGCTGTTTTTT	None	True	3	NaN
RT1_TAAGAGATCCTTAATC-1-8	4.5-IL6ST	Renal1	RT1	Tumor	renal1.tnb.C83	False	TRA	nan	TRB	nan	...	nan	TRBC2	nan	TGTGCAATGAGCGAGATTTCTGGTGGCTACAATAAGCTGATTTTT	None	TGTGCCTGGAGTGACAGGTCAGACGAGCAGTACTTC	None	True	8	NaN
LN2_TCTGAGAAGGACACCA-2-2	4.6b-Treg	Lung2	LN2	NAT	lung2.tn.C6211	False	TRA	nan	TRB	nan	...	nan	TRBC2	nan	TGTGCTACGGAGGAATATGGAAACAAGCTGGTCTTT	None	TGTGCCAGCAGCTTTCTACCTTTGGGAGATACGCAGTATTTT	None	True	2	NaN

3000 rows × 45 columns

Note

Importing data

scirpy natively supports reading IR data from Cellranger (10x), TraCeR (Smart-seq2) or the AIRR rearrangement schema and provides helper functions to import other data types. We provide a dedicated tutorial on data loading with more details.

This particular dataset has been imported using scirpy.io.read_10x_vdj() and merged with transcriptomics data using scirpy.pp.merge_with_ir(). The exact procedure is described in scirpy.datasets.wu2020().

Preprocess Transcriptomics data¶

Transcriptomics data needs to be filtered and preprocessed as with any other single-cell dataset. We recommend following the scanpy tutorial and the best practice paper by Luecken et al.

[7]:

sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=100)

filtered out 18877 genes that are detected in less than 10 cells

[8]:

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1000)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=5000)
sc.tl.pca(adata)
sc.pp.neighbors(adata)

normalizing by total count per cell
    finished (0:00:00): normalized adata.X and added    'n_counts', counts per cell before normalization (adata.obs)
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:02)

For the Wu2020 dataset, the authors already provide clusters and UMAP coordinates. Instead of performing clustering and cluster annotation ourselves, we will just use the provided data. The clustering and annotation methodology is described in their paper.

[9]:

adata.obsm["X_umap"] = adata.obsm["X_umap_orig"]

[10]:

mapping = {
    "nan": "other",
    "3.1-MT": "other",
    "4.1-Trm": "CD4_Trm",
    "4.2-RPL32": "CD4_RPL32",
    "4.3-TCF7": "CD4_TCF7",
    "4.4-FOS": "CD4_FOSS",
    "4.5-IL6ST": "CD4_IL6ST",
    "4.6a-Treg": "CD4_Treg",
    "4.6b-Treg": "CD4_Treg",
    "8.1-Teff": "CD8_Teff",
    "8.2-Tem": "CD8_Tem",
    "8.3a-Trm": "CD8_Trm",
    "8.3b-Trm": "CD8_Trm",
    "8.3c-Trm": "CD8_Trm",
    "8.4-Chrom": "other",
    "8.5-Mitosis": "other",
    "8.6-KLRB1": "other",
}
adata.obs["cluster"] = [mapping[x] for x in adata.obs["cluster_orig"]]

Let’s inspect the UMAP plots. The first three panels show the UMAP plot colored by sample, patient and cluster. We don’t observe any clustering of samples or patients that could hint at batch effects.

The last three panels show the UMAP colored by the T cell markers CD8, CD4, and FOXP3. We can confirm that the markers correspond to their respective cluster labels.

[11]:

sc.pl.umap(
    adata,
    color=["sample", "patient", "cluster", "CD8A", "CD4", "FOXP3"],
    ncols=2,
    wspace=0.7,
)

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cluster' as categorical

../_images/tutorials_tutorial_3k_tcr_18_1.png

TCR Quality Control¶

While most of T cell receptors have exactly one pair of α and β chains, up to one third of T cells can have dual TCRs, i.e. two pairs of receptors originating from different alleles ([SB19]).

Using the scirpy.tl.chain_qc() function, we can add a summary about the Immune cell-receptor compositions to adata.obs. We can visualize it using scirpy.pl.group_abundance().

Note

chain pairing

Orphan chain refers to cells that have either a single alpha or beta receptor chain.
Extra chain refers to cells that have a full alpha/beta receptor pair, and an additional chain.
Multichain refers to cells with more than two receptor pairs detected. These cells are likely doublets.

Note

receptor type and receptor subtype

receptor_type refers to a coarse classification into BCR and TCR. Cells with both BCR and TCR chains are labelled as ambiguous.
receptor_subtype refers to a more specific classification into α/β, ɣ/δ, IG-λ, and IG-κ chain configurations.

For more details, see scirpy.tl.chain_qc().

[12]:

ir.tl.chain_qc(adata)

As expected, the dataset contains only α/β T-cell receptors:

[13]:

ax = ir.pl.group_abundance(adata, groupby="receptor_subtype", target_col="source")

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'receptor_type' as categorical
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'receptor_subtype' as categorical
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'chain_pairing' as categorical

../_images/tutorials_tutorial_3k_tcr_23_1.png

[14]:

ax = ir.pl.group_abundance(adata, groupby="chain_pairing", target_col="source")

../_images/tutorials_tutorial_3k_tcr_24_0.png

Indeed, in this dataset, ~6% of cells have more than one pair of productive T-cell receptors:

[15]:

print(
    "Fraction of cells with more than one pair of TCRs: {:.2f}".format(
        np.sum(
            adata.obs["chain_pairing"].isin(
                ["extra VJ", "extra VDJ", "two full chains"]
            )
        )
        / adata.n_obs
    )
)

Fraction of cells with more than one pair of TCRs: 0.06

Next, we visualize the Multichain-cells on the UMAP plot and exclude them from downstream analysis:

[16]:

sc.pl.umap(adata, color="chain_pairing", groups="multichain")

../_images/tutorials_tutorial_3k_tcr_28_0.png

[17]:

adata = adata[adata.obs["chain_pairing"] != "multichain", :].copy()

Similarly, we can use the chain_pairing information to exclude all cells that don’t have at least one full pair of receptor sequences:

[18]:

adata = adata[~adata.obs["chain_pairing"].isin(["orphan VDJ", "orphan VJ"]), :].copy()

Finally, we re-create the chain-pairing plot to ensure that the filtering worked as expected:

[19]:

ax = ir.pl.group_abundance(adata, groupby="chain_pairing", target_col="source")

../_images/tutorials_tutorial_3k_tcr_33_0.png

Define clonotypes and clonotype clusters¶

Warning

The way scirpy computes clonotypes and clonotype clusters has been changed in v0.7.0.

Code written based on scirpy versions prior to v0.7.0 needs to be updated. Please read the release notes and follow the description in this section.

In this section, we will define and visualize clonotypes and clonotype clusters.

Scirpy implements a network-based approach for clonotype definition. The steps to create and visualize the clonotype-network are analogous to the construction of a neighborhood graph from transcriptomics data with scanpy.

Analysis steps on IR data¶
scirpy function	objective
`scirpy.pp.ir_dist()`	Compute sequence-based distance matrices for all VJ and VDJ sequences.
`scirpy.tl.define_clonotypes()`	Define clonotypes by nucleotide sequence identity.
`scirpy.tl.define_clonotype_clusters()`	Cluster cells by the similarity of their CDR3-sequences.
`scirpy.tl.clonotype_network()`	Compute layout of the clonotype network.
`scirpy.pl.clonotype_network()`	Plot clonotype network colored by different parameters.

Compute CDR3 neighborhood graph and define clonotypes¶

scirpy.pp.ir_dist() computes distances between CDR3 nucleotide (nt) or amino acid (aa) sequences, either based on sequence identity or similarity. It creates two distance matrices: one for all unique VJ sequences and one for all unique VDJ sequences. The distance matrices are added to adata.uns.

The function scirpy.tl.define_clonotypes() matches cells based on the distances of their VJ and VDJ CDR3-sequences and value of the function parameters dual_ir and receptor_arms. Finally, it detects connected modules in the graph and annotates them as clonotypes. This will add a clone_id and clone_id_size column to adata.obs.

The dual_ir parameter defines how scirpy handles cells with more than one pair of receptors. The default value is any which implies that cells with any of their primary or secondary receptor chain matching will be considered to be of the same clonotype.

Here, we define clonotypes based on nt-sequence identity. In a later step, we will define clonotype clusters based on amino-acid similarity.

[20]:

# using default parameters, `ir_dist` will compute nucleotide sequence identity
ir.pp.ir_dist(adata)
ir.tl.define_clonotypes(adata, receptor_arms="all", dual_ir="primary_only")

Computing sequence x sequence distance matrix for VJ sequences.
Computing sequence x sequence distance matrix for VDJ sequences.
Initializing lookup tables.
Computing clonotype x clonotype distances.

100%|██████████| 1526/1526 [00:02<00:00, 590.01it/s]

Stored clonal assignments in `adata.obs["clone_id"]`.

To visualize the network we first call scirpy.tl.clonotype_network() to compute the layout. We can then visualize it using scirpy.pl.clonotype_network(). We recommend setting the min_cells parameter to >=2, to prevent the singleton clonotypes from cluttering the network.

[21]:

ir.tl.clonotype_network(adata, min_cells=2)

The resulting plot is a network, where each dot represents cells with identical receptor configurations. As we define clonotypes as cells with identical CDR3-sequences, each dot is also a clonotype. For each clonotype, the numeric clonotype id is shown in the graph. The size of each dot refers to the number of cells with the same receptor configurations. Categorical variables can be visualized as pie charts.

[22]:

ir.pl.clonotype_network(
    adata, color="source", base_size=20, label_fontsize=9, panel_size=(7, 7)
)

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'clone_id' as categorical

[22]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_42_2.png

Re-compute CDR3 neighborhood graph and define clonotype clusters¶

We can now re-compute the clonotype network based on amino-acid sequence similarity and define clonotype clusters.

To this end, we need to change set metric=”alignment” and specify a cutoff parameter. The distance is based on the BLOSUM62 matrix. For instance, a distance of 10 is equivalent to 2 R`s mutating into `N. This appoach was initially proposed as TCRdist by Dash et al. ([DFGH+17]).

All cells with a distance between their CDR3 sequences lower than cutoff will be connected in the network.

[23]:

ir.pp.ir_dist(
    adata,
    metric="alignment",
    sequence="aa",
    cutoff=15,
)

Computing sequence x sequence distance matrix for VJ sequences.

100%|██████████| 496/496 [00:13<00:00, 35.60it/s]

Computing sequence x sequence distance matrix for VDJ sequences.


100%|██████████| 496/496 [00:13<00:00, 35.58it/s]

[24]:

ir.tl.define_clonotype_clusters(
    adata, sequence="aa", metric="alignment", receptor_arms="all", dual_ir="any"
)

Initializing lookup tables.
Computing clonotype x clonotype distances.

100%|██████████| 1549/1549 [00:07<00:00, 197.82it/s]

Stored clonal assignments in `adata.obs["cc_aa_alignment"]`.

[25]:

ir.tl.clonotype_network(adata, min_cells=3, sequence="aa", metric="alignment")

Compared to the previous plot, we observere several connected dots. Each fully connected subnetwork represents a “clonotype cluster”, each dot still represents cells with identical receptor configurations.

The dots are colored by patient. We observe, that for instance, clonotypes 101 and 68 (left top and bottom) are private, i.e. they contain cells from a single patient only. On the other hand, clonotype 159 (left middle) is public, i.e. it is shared across patients Lung1 and Lung3.

[26]:

ir.pl.clonotype_network(
    adata, color="patient", label_fontsize=9, panel_size=(7, 7), base_size=20
)

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cc_aa_alignment' as categorical

[26]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_49_2.png

We can now extract information (e.g. CDR3-sequences) from a specific clonotype cluster by subsetting AnnData. When extracting the CDR3 sequences of clonotype cluster 159, we retreive five different receptor configurations with different numbers of cells, corresponding to the five points in the graph.

[27]:

adata.obs.loc[adata.obs["cc_aa_alignment"] == "159", :].groupby(
    [
        "IR_VJ_1_junction_aa",
        "IR_VJ_2_junction_aa",
        "IR_VDJ_1_junction_aa",
        "IR_VDJ_2_junction_aa",
        "receptor_subtype",
    ],
    observed=True,
).size().reset_index(name="n_cells")

[27]:

	IR_VJ_1_junction_aa	IR_VJ_2_junction_aa	IR_VDJ_1_junction_aa	IR_VDJ_2_junction_aa	receptor_subtype	n_cells
0	CAGKSGNTGKLIF	None	CASSYQGATEAFF	None	TRA+TRB	12
1	CAGKSGNTGKLIF	CATDPRRSTGNQFYF	CASSYQGATEAFF	None	TRA+TRB	1
2	CATDPRRSTGNQFYF	None	CASSYQGATEAFF	None	TRA+TRB	11
3	CATDPRRSTGNQFYF	CAGKSGNTGKLIF	CASSYQGATEAFF	None	TRA+TRB	1
4	CAGKAGNTGKLIF	None	CASSYQGSTEAFF	None	TRA+TRB	1

Including the V-gene in clonotype definition¶

Using the paramter use_v_gene in define_clonotypes(), we can enforce clonotypes (or clonotype clusters) to have the same V-gene, and, therefore, the same CDR1 and 2 regions. Let’s look for clonotype clusters with different V genes:

[28]:

ir.tl.define_clonotype_clusters(
    adata,
    sequence="aa",
    metric="alignment",
    receptor_arms="all",
    dual_ir="any",
    same_v_gene=True,
    key_added="cc_aa_alignment_same_v",
)

Initializing lookup tables.
Computing clonotype x clonotype distances.

100%|██████████| 1549/1549 [00:08<00:00, 180.84it/s]

Stored clonal assignments in `adata.obs["cc_aa_alignment_same_v"]`.

[29]:

# find clonotypes with more than one `clonotype_same_v`
ct_different_v = adata.obs.groupby("cc_aa_alignment").apply(
    lambda x: x["cc_aa_alignment_same_v"].nunique() > 1
)
ct_different_v = ct_different_v[ct_different_v].index.values.tolist()
ct_different_v

[29]:

['280', '765']

Here, we see that the clonotype clusters 280 and 765 get split into (280, 788) and (765, 1071), respectively, when the same_v_gene flag is set.

[30]:

adata.obs.loc[
    adata.obs["cc_aa_alignment"].isin(ct_different_v),
    [
        "cc_aa_alignment",
        "cc_aa_alignment_same_v",
        "IR_VJ_1_v_call",
        "IR_VDJ_1_v_call",
    ],
].sort_values("cc_aa_alignment").drop_duplicates().reset_index(drop=True)

[30]:

	cc_aa_alignment	cc_aa_alignment_same_v	IR_VJ_1_v_call	IR_VDJ_1_v_call
0	280	280	TRAV8-6	TRBV6-6
1	280	788	TRAV8-3	TRBV9
2	765	765	TRAV21	TRBV6-6
3	765	1071	TRAV21	TRBV6-5

Clonotype analysis¶

Clonal expansion¶

Let’s visualize the number of expanded clonotypes (i.e. clonotypes consisting of more than one cell) by cell-type. The first option is to add a column with the scirpy.tl.clonal_expansion() to adata.obs and overlay it on the UMAP plot.

[31]:

ir.tl.clonal_expansion(adata)

clonal_expansion refers to expansion categories, i.e singleton clonotypes, clonotypes with 2 cells and more than 2 cells. The clonotype_size refers to the absolute number of cells in a clonotype.

[32]:

sc.pl.umap(adata, color=["clonal_expansion", "clone_id_size"])

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cc_aa_alignment_same_v' as categorical
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'clonal_expansion' as categorical

../_images/tutorials_tutorial_3k_tcr_62_1.png

The second option is to show the number of cells belonging to an expanded clonotype per category in a stacked bar plot, using the scirpy.pl.clonal_expansion() plotting function.

[33]:

ir.pl.clonal_expansion(adata, groupby="cluster", clip_at=4, normalize=False)

[33]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_64_1.png

The same plot, normalized to cluster size. Clonal expansion is a sign of positive selection for a certain, reactive T-cell clone. It, therefore, makes sense that CD8+ effector T-cells have the largest fraction of expanded clonotypes.

[34]:

ir.pl.clonal_expansion(adata, "cluster")

[34]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_66_1.png

Expectedly, the CD8+ effector T cells have the largest fraction of expanded clonotypes.

Consistent with this observation, they have the lowest scirpy.pl.alpha_diversity() of clonotypes.

[35]:

ax = ir.pl.alpha_diversity(
    adata, metric="normalized_shannon_entropy", groupby="cluster"
)

../_images/tutorials_tutorial_3k_tcr_68_0.png

Clonotype abundance¶

The function scirpy.pl.group_abundance() allows us to create bar charts for arbitrary categorial from obs. Here, we use it to show the distribution of the ten largest clonotypes across the cell-type clusters.

[36]:

ir.pl.group_abundance(adata, groupby="clone_id", target_col="cluster", max_cols=10)

[36]:

<AxesSubplot:title={'center':'Number of cells in clone_id by cluster'}, xlabel='clone_id', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_71_1.png

It might be beneficial to normalize the counts to the number of cells per sample to mitigate biases due to different sample sizes:

[37]:

ir.pl.group_abundance(
    adata, groupby="clone_id", target_col="cluster", max_cols=10, normalize="sample"
)

[37]:

<AxesSubplot:title={'center':'Fraction of cluster in each clone_id'}, xlabel='clone_id', ylabel='Fraction of cells in sample'>

../_images/tutorials_tutorial_3k_tcr_73_1.png

Coloring the bars by patient gives us information about public and private clonotypes: Some clonotypes are private, i.e. specific to a certain tissue, others are public, i.e. they are shared across different tissues.

[38]:

ax = ir.pl.group_abundance(
    adata, groupby="clone_id", target_col="source", max_cols=15, figsize=(5, 3)
)

../_images/tutorials_tutorial_3k_tcr_75_0.png

However, clonotypes that are shared between patients are rare:

[39]:

ax = ir.pl.group_abundance(
    adata, groupby="clone_id", target_col="patient", max_cols=15, figsize=(5, 3)
)

../_images/tutorials_tutorial_3k_tcr_77_0.png

Gene usage¶

scirpy.tl.group_abundance() can also give us some information on VDJ usage. We can choose any of the {TRA,TRB}_{1,2}_{v,d,j,c}_gene columns to make a stacked bar plot. We use max_col to limit the plot to the 10 most abundant V-genes.

[40]:

ax = ir.pl.group_abundance(
    adata, groupby="IR_VJ_1_v_call", target_col="cluster", normalize=True, max_cols=10
)

../_images/tutorials_tutorial_3k_tcr_80_0.png

We can pre-select groups by filtering adata:

[41]:

ax = ir.pl.group_abundance(
    adata[
        adata.obs["IR_VDJ_1_v_call"].isin(
            ["TRBV20-1", "TRBV7-2", "TRBV28", "TRBV5-1", "TRBV7-9"]
        ),
        :,
    ],
    groupby="cluster",
    target_col="IR_VDJ_1_v_call",
    normalize=True,
)

Trying to set attribute `.uns` of view, copying.

../_images/tutorials_tutorial_3k_tcr_82_1.png

The exact combinations of VDJ genes can be visualized as a Sankey-plot using scirpy.pl.vdj_usage().

[42]:

ax = ir.pl.vdj_usage(adata, full_combination=False, max_segments=None, max_ribbons=30)

../_images/tutorials_tutorial_3k_tcr_84_0.png

We can also use this plot to investigate the exact VDJ composition of one (or several) clonotypes:

[43]:

ir.pl.vdj_usage(
    adata[adata.obs["clone_id"].isin(["68", "101", "127", "161"]), :],
    max_ribbons=None,
    max_segments=100,
)

[43]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_86_1.png

Spectratype plots¶

spectratype() plots give us information about the length distribution of CDR3 regions.

[44]:

ir.pl.spectratype(adata, color="cluster", viztype="bar", fig_kws={"dpi": 120})

[44]:

<AxesSubplot:title={'center':'Spectratype of IR_VJ_1_junction_aa by cluster'}, xlabel='IR_VJ_1_junction_aa length', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_89_1.png

The same chart visualized as “ridge”-plot:

[45]:

ir.pl.spectratype(
    adata,
    color="cluster",
    viztype="curve",
    curve_layout="shifted",
    fig_kws={"dpi": 120},
    kde_kws={"kde_norm": False},
)

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/scirpy/_plotting/base.py:262: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(order)

[45]:

<AxesSubplot:title={'center':'Spectratype of IR_VJ_1_junction_aa by cluster'}, xlabel='IR_VJ_1_junction_aa length'>

../_images/tutorials_tutorial_3k_tcr_91_2.png

A spectratype-plot by gene usage. To pre-select specific genes, we can simply filter the adata object before plotting.

[46]:

ir.pl.spectratype(
    adata[
        adata.obs["IR_VDJ_1_v_call"].isin(
            ["TRBV20-1", "TRBV7-2", "TRBV28", "TRBV5-1", "TRBV7-9"]
        ),
        :,
    ],
    cdr3_col="IR_VDJ_1_junction_aa",
    color="IR_VDJ_1_v_call",
    normalize="sample",
    fig_kws={"dpi": 120},
)

Trying to set attribute `.uns` of view, copying.

[46]:

<AxesSubplot:title={'center':'Spectratype of IR_VDJ_1_junction_aa by IR_VDJ_1_v_call'}, xlabel='IR_VDJ_1_junction_aa length', ylabel='Fraction of cells in sample'>

../_images/tutorials_tutorial_3k_tcr_93_2.png

Comparing repertoires¶

Repertoire simlarity and overlaps¶

Overlaps in the adaptive immune receptor repertoire of samples or sample groups enables to pinpoint important clonotype groups, as well as to provide a measure of similarity between samples. Running Scirpy’s repertoire_overlap() tool creates a matrix featuring the abundance of clonotypes in each sample. Additionally, it also computes a (Jaccard) distance matrix of samples as well as the linkage of hierarchical clustering.

[47]:

df, dst, lk = ir.tl.repertoire_overlap(adata, "sample", inplace=False)
df.head()

[47]:

clone_id	0	1	2	3	4	5	6	7	8	9	...	1516	1517	1518	1519	1520	1521	1522	1523	1524	1525
sample
CN1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
CT1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
CT2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
EN1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
EN2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1526 columns

The distance matrix can be shown as a heatmap, while samples are reordered based on hierarchical clustering.

[48]:

ir.pl.repertoire_overlap(adata, "sample", heatmap_cats=["patient", "source"])

[48]:

<seaborn.matrix.ClusterGrid at 0x7f94380b2250>

../_images/tutorials_tutorial_3k_tcr_98_1.png

A specific pair of samples can be compared on a scatterplot, where dot size corresponds to the number of clonotypes at a given coordinate.

[49]:

ir.pl.repertoire_overlap(
    adata, "sample", pair_to_plot=["LN2", "LT2"], fig_kws={"dpi": 120}
)

No handles with labels found to put in legend.

[49]:

<AxesSubplot:title={'center':'Repertoire overlap between LN2 and LT2'}, xlabel='Clonotype size in LN2', ylabel='Clonotype size in LT2'>

../_images/tutorials_tutorial_3k_tcr_100_2.png

Integrating gene expression data¶

Leveraging the opportunity offered by close integeration with scanpy, transcriptomics-based data can be utilized alongside immune receptor data.

Clonotype modularity¶

Using the Clonotype modularity we can identify clonotypes consisting of cells that are transcriptionally more similar than expected by random.

The clonotype modularity score represents the log2 fold change of the number of edges in the cell-cell neighborhood graph compared to the random background model. Clonotypes (or clonotype clusters) with a high modularity score consist of cells that have a similar molecular phenotype.

[50]:

ir.tl.clonotype_modularity(adata, target_col="cc_aa_alignment")

Initalizing clonotype subgraphs...

100%|██████████| 1487/1487 [00:00<00:00, 6554.96it/s]

Computing background distributions...


100%|██████████| 1000/1000 [00:01<00:00, 766.86it/s]

We can plot the clonotype modularity on top of a umap of clonotype network plot

[51]:

sc.pl.umap(adata, color="clonotype_modularity")

../_images/tutorials_tutorial_3k_tcr_105_0.png

[52]:

ir.pl.clonotype_network(
    adata,
    color="clonotype_modularity",
    label_fontsize=9,
    panel_size=(6, 6),
    base_size=20,
)

[52]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_106_1.png

We can also visualize the clonotype modularity together with the associated FDR as a sort of “one sided volcano plot”:

[53]:

ir.pl.clonotype_modularity(adata, base_size=20)

[53]:

<AxesSubplot:xlabel='modularity score', ylabel='-log10(FDR)'>

../_images/tutorials_tutorial_3k_tcr_108_1.png

Let’s further inspect the two top scoring candidates. We can extract that information from adata.obs["clonotype_modularity"].

[54]:

clonotypes_top_modularity = list(
    adata.obs.set_index("cc_aa_alignment")["clonotype_modularity"]
    .sort_values(ascending=False)
    .index.unique().values[:2]
)

[55]:

sc.pl.umap(
    adata,
    color="cc_aa_alignment",
    groups=clonotypes_top_modularity,
    palette=cycler(color=mpl_cm.Dark2_r.colors),
)

../_images/tutorials_tutorial_3k_tcr_111_0.png

We observe that they are (mostly) restricted to a single cluster. By leveraging scanpy’s differential expression module, we can compare the gene expression of the cells in the two clonotypes to the rest.

[56]:

sc.tl.rank_genes_groups(
    adata, "clone_id", groups=clonotypes_top_modularity, reference="rest", method="wilcoxon"
)
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
for ct, ax in zip(clonotypes_top_modularity, axs):
    sc.pl.rank_genes_groups_violin(adata, groups=[ct], n_genes=15, ax=ax, show=False, strip=False)

ranking genes
    finished (0:00:01)

../_images/tutorials_tutorial_3k_tcr_113_1.png

Clonotype imbalance among cell clusters¶

Using cell type annotation inferred from gene expression clusters, for example, clonotypes belonging to CD8+ effector T-cells and CD8+ tissue-resident memory T cells, can be compared.

[57]:

freq, stat = ir.tl.clonotype_imbalance(
    adata,
    replicate_col="sample",
    groupby="cluster",
    case_label="CD8_Teff",
    control_label="CD8_Trm",
    inplace=False,
)
top_differential_clonotypes = stat["clone_id"].tolist()[:3]

WARNING: Clonotype imbalance calculation depends on repertoire overlap. We could not detect any previous runs of repertoire_overlap, so the tool is running now...

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/scirpy/io/_util.py:67: FutureWarning: clonotype_imbalance is a deprecated function and will be removed in a future version of scirpy. Consider using `tl.clonotype_modularity` instead. If `clonotype_modularity` does not cover your use-case, please create an issue on GitHub to let us know such that we can take it into account! (https://github.com/icbi-lab/scirpy/issues)
  return f(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/scirpy/_tools/_clonotype_imbalance.py:280: RuntimeWarning: divide by zero encountered in log2
  logfoldchange = np.log2(
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/scirpy/_tools/_clonotype_imbalance.py:281: RuntimeWarning: divide by zero encountered in double_scalars
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)

Showing top clonotypes on a UMAP clearly shows that clonotype 101 is featured by CD8+ tissue-resident memory T cells, while clonotype 68 by CD8+ effector and effector memory cells.

[58]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), gridspec_kw={"wspace": 0.6})
sc.pl.umap(adata, color="cluster", ax=ax1, show=False)
sc.pl.umap(
    adata,
    color="clone_id",
    groups=top_differential_clonotypes,
    ax=ax2,
    # increase size of highlighted dots
    size=[
        80 if c in top_differential_clonotypes else 30 for c in adata.obs["clone_id"]
    ],
    palette=cycler(color=mpl_cm.Dark2_r.colors),
)

../_images/tutorials_tutorial_3k_tcr_117_0.png

Repertoire overlap of cell types¶

Just like comparing repertoire overlap among samples, Scirpy also offers comparison between gene expression clusters or cell subpopulations. As an example, repertoire overlap of the two cell types compared above is shown.

[59]:

ir.tl.repertoire_overlap(adata, "cluster")
ir.pl.repertoire_overlap(
    adata, "cluster", pair_to_plot=["CD8_Teff", "CD8_Trm"], fig_kws={"dpi": 120}
)

No handles with labels found to put in legend.

[59]:

<AxesSubplot:title={'center':'Repertoire overlap between CD8_Teff and CD8_Trm'}, xlabel='Clonotype size in CD8_Teff', ylabel='Clonotype size in CD8_Trm'>

../_images/tutorials_tutorial_3k_tcr_119_2.png

Marker genes in top clonotypes¶

Gene expression of cells belonging to individual clonotypes can also be compared using Scanpy. As an example, differential gene expression of two clonotypes, found to be specific to cell type clusters can also be analysed.

[60]:

sc.tl.rank_genes_groups(
    adata, "clone_id", groups=["101"], reference="68", method="wilcoxon"
)
sc.pl.rank_genes_groups_violin(adata, groups="101", n_genes=15)

ranking genes
    finished (0:00:00)

../_images/tutorials_tutorial_3k_tcr_121_1.png