Analysis of 3k T cells from cancer¶

In this tutorial, we re-analyze single-cell TCR/RNA-seq data from Wu et al. ([WMdA+20]) generated on the 10x Genomics platform. The original dataset consists of >140k T cells from 14 treatment-naive patients across four different types of cancer. For this tutorial, to speed up computations, we use a downsampled version of 3k cells.

[1]:

# This cell is for development only. Don't copy this to your notebook.
%load_ext autoreload
%autoreload 2
import sys
import warnings

sys.path.insert(0, "../..")
warnings.filterwarnings("ignore", category=FutureWarning)

[2]:

import numpy as np
import pandas as pd
import scanpy as sc
import scirpy as ir
from matplotlib import pyplot as plt

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/setuptools_scm/git.py:68: UserWarning: "/home/runner/work/scirpy/scirpy" is shallow and may cause errors
  warnings.warn('"{}" is shallow and may cause errors'.format(wd.path))

[3]:

sc.logging.print_header()

scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.19.5 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.1 python-igraph==0.8.3 leidenalg==0.8.3

The dataset ships with the scirpy package. We can conveniently load it from the datasets module:

[4]:

adata = ir.datasets.wu2020_3k()

wu2020_3k.h5ad: 16.7MB [00:00, 74.1MB/s]

adata is a regular AnnData object with additional, IR-specific columns in obs. For more information, check the page about Scirpy’s data structure.

Note

For more information about our T-cell receptor model, see Immune receptor (IR) model.

[5]:

adata.shape

[5]:

(3000, 30727)

[6]:

adata.obs

[6]:

	cluster_orig	patient	sample	source	clonotype_orig	multi_chain	IR_VJ_1_locus	IR_VJ_2_locus	IR_VDJ_1_locus	IR_VDJ_2_locus	...	IR_VJ_1_c_gene	IR_VJ_2_c_gene	IR_VDJ_1_c_gene	IR_VDJ_2_c_gene	IR_VJ_1_cdr3_nt	IR_VJ_2_cdr3_nt	IR_VDJ_1_cdr3_nt	IR_VDJ_2_cdr3_nt	has_ir	batch
LN1_GTAGGCCAGCGTAGTG-1-19	4.4-FOS	Lung1	LN1	NAT	lung1.tn.C223	False	nan	nan	TRB	TRB	...	nan	nan	TRBC2	TRBC2	None	None	TGTGCCAGCAGCTTAATGCGGCTAGCGGGAGATACGCAGTATTTT	TGTGCAAGTCGCTTAGCGGTTTTATCGACTAGCGGGAGTGTCGGAG...	True	19
RN2_AGAGCGACAGATTGCT-1-27	4.4-FOS	Renal2	RN2	NAT	renal2.tnb.C1362	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC1	nan	TGTGCTGTGAGGGGGAATAACAATGCCAGACTCATGTTT	None	TGTGCCAGCAGCTTTGGAACGGTGGCTGAAGCTTTCTTT	None	True	27
LN1_GTCATTTCAATGAAAC-1-19	8.2-Tem	Lung1	LN1	NAT	lung1.tn.C25	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC1	nan	TGTGCTGTGAGGTTGGGTAACCAGTTCTATTTT	None	TGCAGTGCTAGAGATGGAGGGGGGGGGAACACTGAAGCTTTCTTT	None	True	19
LN2_GACACGCAGGTAGCTG-2-2	8.6-KLRB1	Lung2	LN2	NAT	lung2.tn.C2452	False	nan	nan	TRB	nan	...	nan	nan	TRBC2	nan	None	None	TGTGCCAGCAGCCAAGGTCAGGGACAGGATTTTAACTACGAGCAGT...	None	True	2
LN2_GCACTCTCAGGGATTG-2-2	4.4-FOS	Lung2	LN2	NAT	lung2.tn.C5631	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC1	nan	TGTGCAGCAAGCGACCCCACGGTCGAGGCAGGAACTGCTCTGATCTTT	None	TGTGCCAGCAGCTTGACCGTTAACACTGAAGCTTTCTTT	None	True	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
RT3_GCAGTTAGTATGAAAC-1-6	4.2-RPL32	Renal3	RT3	Tumor	renal3.tnb.C176	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC2	nan	TGTGCTGCGATGGATAGCAACTATCAGTTAATCTGG	None	TGTGCCACCAAGGACAGGGAAGACACCGGGGAGCTGTTTTTT	None	True	6
LT1_GACGTGCTCTCAAGTG-1-24	8.2-Tem	Lung1	LT1	Tumor	lung1.tn.C151	False	TRA	nan	nan	nan	...	TRAC	nan	nan	nan	TGTGCTTATAGGAGTTCCCTTGGTGGTGCTACAAACAAGCTCATCTTT	None	None	None	True	24
ET3_GCTGGGTAGACCTTTG-1-3	3.1-MT	Endo3	ET3	Tumor	endo3.tn.C76	False	nan	nan	TRB	nan	...	nan	nan	TRBC2	nan	None	None	TGTGCCAGCAGCCGGACAGGGGGGGATTCCGGGGAGCTGTTTTTT	None	True	3
RT1_TAAGAGATCCTTAATC-1-8	4.5-IL6ST	Renal1	RT1	Tumor	renal1.tnb.C83	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC2	nan	TGTGCAATGAGCGAGATTTCTGGTGGCTACAATAAGCTGATTTTT	None	TGTGCCTGGAGTGACAGGTCAGACGAGCAGTACTTC	None	True	8
LN2_TCTGAGAAGGACACCA-2-2	4.6b-Treg	Lung2	LN2	NAT	lung2.tn.C6211	False	TRA	nan	TRB	nan	...	TRAC	nan	TRBC2	nan	TGTGCTACGGAGGAATATGGAAACAAGCTGGTCTTT	None	TGTGCCAGCAGCTTTCTACCTTTGGGAGATACGCAGTATTTT	None	True	2

3000 rows × 44 columns

Note

Importing data

scirpy natively supports reading IR data from Cellranger (10x), TraCeR (Smart-seq2) or the AIRR rearrangement schema and provides helper functions to import other data types. We provide a dedicated tutorial on data loading with more details.

This particular dataset has been imported using scirpy.io.read_10x_vdj() and merged with transcriptomics data using scirpy.pp.merge_with_ir(). The exact procedure is described in scirpy.datasets.wu2020().

Preprocess Transcriptomics data¶

Transcriptomics data needs to be filtered and preprocessed as with any other single-cell dataset. We recommend following the scanpy tutorial and the best practice paper by Luecken et al.

[7]:

sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=100)

[8]:

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1000)
sc.pp.log1p(adata)

For the Wu2020 dataset, the authors already provide clusters and UMAP coordinates. Instead of performing clustering and cluster annotation ourselves, we will just use the provided data. The clustering and annotation methodology is described in their paper.

[9]:

adata.obsm["X_umap"] = adata.obsm["X_umap_orig"]

[10]:

mapping = {
    "nan": "other",
    "3.1-MT": "other",
    "4.1-Trm": "CD4_Trm",
    "4.2-RPL32": "CD4_RPL32",
    "4.3-TCF7": "CD4_TCF7",
    "4.4-FOS": "CD4_FOSS",
    "4.5-IL6ST": "CD4_IL6ST",
    "4.6a-Treg": "CD4_Treg",
    "4.6b-Treg": "CD4_Treg",
    "8.1-Teff": "CD8_Teff",
    "8.2-Tem": "CD8_Tem",
    "8.3a-Trm": "CD8_Trm",
    "8.3b-Trm": "CD8_Trm",
    "8.3c-Trm": "CD8_Trm",
    "8.4-Chrom": "other",
    "8.5-Mitosis": "other",
    "8.6-KLRB1": "other",
}
adata.obs["cluster"] = [mapping[x] for x in adata.obs["cluster_orig"]]

Let’s inspect the UMAP plots. The first three panels show the UMAP plot colored by sample, patient and cluster. We don’t observe any clustering of samples or patients that could hint at batch effects.

The last three panels show the UMAP colored by the T cell markers CD8, CD4, and FOXP3. We can confirm that the markers correspond to their respective cluster labels.

[11]:

sc.pl.umap(
    adata,
    color=["sample", "patient", "cluster", "CD8A", "CD4", "FOXP3"],
    ncols=2,
    wspace=0.5,
)

... storing 'cluster' as categorical

../_images/tutorials_tutorial_3k_tcr_18_1.svg

TCR Quality Control¶

While most of T cell receptors have exactly one pair of α and β chains, up to one third of T cells can have dual TCRs, i.e. two pairs of receptors originating from different alleles ([SB19]).

Using the scirpy.tl.chain_qc() function, we can add a summary about the T cell receptor compositions to adata.obs. We can visualize it using scirpy.pl.group_abundance().

Note

chain pairing

Orphan chain refers to cells that have either a single alpha or beta receptor chain.
Extra chain refers to cells that have a full alpha/beta receptor pair, and an additional chain.
Multichain refers to cells with more than two receptor pairs detected. These cells are likely doublets.

Note

receptor type and receptor subtype

receptor_type refers to a coarse classification into BCR and TCR. Cells both BCR and TCR chains are labelled as ambiguous.
receptor_subtype refers to a more specific classification into α/β, ɣ/δ, IG-λ, and IG-κ chain configurations.

For more details, see scirpy.tl.chain_qc().

[12]:

ir.tl.chain_qc(adata)

As expected, the dataset contains only α/β T-cell receptors:

[13]:

ir.pl.group_abundance(adata, groupby="receptor_subtype", target_col="source")

... storing 'receptor_type' as categorical
... storing 'receptor_subtype' as categorical
... storing 'chain_pairing' as categorical

[13]:

<AxesSubplot:title={'center':'Number of cells in receptor_subtype by source'}, xlabel='receptor_subtype', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_23_2.svg

[14]:

ir.pl.group_abundance(adata, groupby="chain_pairing", target_col="source")

[14]:

<AxesSubplot:title={'center':'Number of cells in chain_pairing by source'}, xlabel='chain_pairing', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_24_1.svg

Indeed, in this dataset, ~6% of cells have more than one pair of productive T-cell receptors:

[15]:

print(
    "Fraction of cells with more than one pair of TCRs: {:.2f}".format(
        np.sum(
            adata.obs["chain_pairing"].isin(
                ["extra VJ", "extra VDJ", "two full chains"]
            )
        )
        / adata.n_obs
    )
)

Fraction of cells with more than one pair of TCRs: 0.06

Next, we visualize the Multichain cells on the UMAP plot and exclude them from downstream analysis:

[16]:

sc.pl.umap(adata, color="multi_chain")

../_images/tutorials_tutorial_3k_tcr_28_0.svg

[17]:

adata = adata[adata.obs["multi_chain"] != "True", :].copy()

Define clonotypes and clonotype clusters¶

In this section, we will define and visualize clonotypes and clonotype clusters.

Scirpy implements a network-based approach for clonotype definition. The steps to create and visualize the clonotype-network are analogous to the construction of a neighborhood graph from transcriptomics data with scanpy.

Analysis steps on transcriptomics data¶
scanpy function	objective
`scanpy.pp.neighbors()`	Compute a nearest-neighbor graph based on gene expression.
`scanpy.tl.leiden()`	Cluster cells by the similarity of their transcriptional profiles.
`scanpy.tl.umap()`	Compute positions of cells in UMAP embedding.
`scanpy.pl.umap()`	Plot UMAP colored by different parameters.

Analysis steps on IR data¶
scirpy function	objective
`scirpy.pp.ir_neighbors()`	Compute a neighborhood graph of CDR3-sequences.
`scirpy.tl.define_clonotypes()`	Define clonotypes by nucleotide sequence identity.
`scirpy.tl.define_clonotype_clusters()`	Cluster cells by the similarity of their CDR3-sequences
`scirpy.tl.clonotype_network()`	Compute positions of cells in clonotype network.
`scirpy.pl.clonotype_network()`	Plot clonotype network colored by different parameters.

Compute CDR3 neighborhood graph and define clonotypes¶

scirpy.pp.ir_neighbors() computes a neighborhood graph based on CDR3 nucleotide (nt) or amino acid (aa) sequences, either based on sequence identity or similarity.

Here, we define clonotypes based on nt-sequence identity. In a later step, we will define clonotype clusters based on amino-acid similarity.

The function scirpy.tl.define_clonotypes() will detect connected modules in the graph and annotate them as clonotypes. This will add a clonotype and clonotype_size column to adata.obs.

[18]:

# using default parameters, `ir_neighbors` will compute nucleotide sequence identity
ir.pp.ir_neighbors(adata, receptor_arms="all", dual_ir="primary_only")
ir.tl.define_clonotypes(adata)

100%|██████████| 1643/1643 [00:00<00:00, 91653.47it/s]
100%|██████████| 2186/2186 [00:00<00:00, 14977.14it/s]
100%|██████████| 8887/8887 [00:00<00:00, 92481.09it/s]

To visualize the network we first call scirpy.tl.clonotype_network() to compute the layout. We can then visualize it using scirpy.pl.clonotype_network(). We recommend setting the min_size parameter to >=2, to prevent the singleton clonotypes from cluttering the network.

[19]:

ir.tl.clonotype_network(adata, min_size=2)
ir.pl.clonotype_network(adata, color="clonotype", legend_loc="none")

... storing 'clonotype' as categorical

[19]:

array([<AxesSubplot:title={'center':'clonotype'}, xlabel='clonotype_network1', ylabel='clonotype_network2'>],
      dtype=object)

../_images/tutorials_tutorial_3k_tcr_36_2.svg

Re-compute CDR3 neighborhood graph and define clonotype clusters¶

We can now re-compute the neighborhood graph based on amino-acid sequence similarity and define clonotype clusters.

To this end, we need to change set metric="alignment" and specify a cutoff parameter. The distance is based on the BLOSUM62 matrix. For instance, a distance of 10 is equivalent to 2 R`s mutating into `N. This appoach was initially proposed as TCRdist by Dash et al. ([DFGH+17]).

All cells with a distance between their CDR3 sequences lower than cutoff will be connected in the network.

[20]:

sc.settings.verbosity = 4

[21]:

ir.pp.ir_neighbors(
    adata,
    metric="alignment",
    sequence="aa",
    cutoff=15,
    receptor_arms="all",
    dual_ir="all",
)
ir.tl.define_clonotype_clusters(
    adata, partitions="connected", sequence="aa", metric="alignment", within_group=None
)

Initializing IrNeighbors object...
Finished initalizing IrNeighbors object.  (0:00:00)
Computing VJ pairwise distances...

100%|██████████| 595/595 [00:18<00:00, 32.44it/s]

Finished computing VJ pairwise distances. (0:00:18)
Computing VDJ pairwise distances...

100%|██████████| 1081/1081 [00:32<00:00, 32.95it/s]

Finished computing VDJ pairwise distances. (0:00:32)
Started comstructing VJ coord-dictionary...


100%|██████████| 3748/3748 [00:00<00:00, 45428.61it/s]

Finished constructing VJ coord-dictionary (0:00:00)
Started comstructing VDJ coord-dictionary...


100%|██████████| 3498/3498 [00:00<00:00, 15746.96it/s]

Finished constructing VDJ coord-dictionary (0:00:00)
Constructing cell x cell distance matrix...


100%|██████████| 22017/22017 [00:00<00:00, 128578.34it/s]

Finished constructing cell x cell distance matrix.  (0:00:00)
    Started converting distances to connectivities.
    Finished converting distances to connectivities.  (0:00:00)

[22]:

ir.tl.clonotype_network(adata, min_size=4, sequence="aa", metric="alignment")

Compared to the previous plot, we observe slightly larger clusters that are not necessarily fully connected any more.

[23]:

ir.pl.clonotype_network(
    adata,
    color="ct_cluster_aa_alignment",
    legend_fontoutline=3,
    size=80,
    panel_size=(6, 6),
    legend_loc="on data",
)

... storing 'ct_cluster_aa_alignment' as categorical

[23]:

array([<AxesSubplot:title={'center':'ct_cluster_aa_alignment'}, xlabel='clonotype_network1', ylabel='clonotype_network2'>],
      dtype=object)

../_images/tutorials_tutorial_3k_tcr_43_2.svg

Now we show the same graph, colored by patient. We observe that for instance clonotypes 104 and 160 (center-left) are private, i.e. they contain cells from a single sample only. On the other hand, for instance clonotype 233 (top-left) is public, i.e. it is shared across patients Lung5 and Lung1 and Lung3.

[24]:

ir.pl.clonotype_network(adata, color="patient", size=80, panel_size=(6, 6))

[24]:

array([<AxesSubplot:title={'center':'patient'}, xlabel='clonotype_network1', ylabel='clonotype_network2'>],
      dtype=object)

../_images/tutorials_tutorial_3k_tcr_45_1.svg

We can now extract information (e.g. CDR3-sequences) from a specific clonotype cluster by subsetting AnnData. For instance, we can find out that clonotype 233 does not have any detected :term:VJ-chain<V(D)J>. In this context, the VJ-chain refers to a TCR-alpha chain.

[25]:

adata.obs.loc[
    adata.obs["ct_cluster_aa_alignment"] == "233",
    [
        "IR_VJ_1_cdr3",
        "IR_VJ_2_cdr3",
        "IR_VDJ_1_cdr3",
        "IR_VDJ_2_cdr3",
        "receptor_subtype",
    ],
]

[25]:

	IR_VJ_1_cdr3	IR_VJ_2_cdr3	IR_VDJ_1_cdr3	IR_VDJ_2_cdr3	receptor_subtype
LN5_TGGACGCAGCCCAACC-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN1_CAGAATCGTCGCATCG-1-19	None	None	CASSYQGATEAFF	None	TRA+TRB
LN1_TCAATCTGTGTGAAAT-1-19	None	None	CASSYQGATEAFF	None	TRA+TRB
LT3_ACTTGTTTCGCATGAT-1-15	None	None	CASSYQGAGELFF	None	TRA+TRB
LN3_TTAGGACGTAAGAGAG-1-11	None	None	CASSYQGSTEAFF	None	TRA+TRB
LN5_AGACGTTAGTGAATTG-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN5_GTAGTCAGTCCGAAGA-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN5_ACTGAGTAGTGCGATG-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN3_GTCTCGTTCGTACGGC-1-11	None	None	CASSYQGSTEAFF	None	TRA+TRB
LN5_TTGCCGTTCACATGCA-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN1_TGCGTGGCAGCAGTTT-1-19	None	None	CASSYQGATEAFF	None	TRA+TRB
LN5_TGCGCAGTCTACTCAT-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN5_CGATGTAGTCATATGC-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN1_GCCTCTATCACTTATC-1-19	None	None	CASSYQGATEAFF	None	TRA+TRB
LN5_TCAGGTACACATCCAA-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB
LN1_GGACATTTCGTAGGTT-1-19	None	None	CASSYQGATEAFF	None	TRA+TRB
LN5_CGGAGTCCACCGGAAA-1-7	None	None	CASSFRGTGELFF	None	TRA+TRB

Including the V-gene in clonotype definition¶

Using the paramter use_v_gene in define_clonotypes(), we can enforce clonotypes (or clonotype clusters) to have the same V-gene, and, therefore, the same CDR1 and 2 regions. Let’s look for clonotype clusters with different V genes:

[26]:

ir.tl.define_clonotype_clusters(
    adata,
    sequence="aa",
    metric="alignment",
    same_v_gene="primary_only",
    key_added="ct_cluster_aa_alignment_same_v",
)

[27]:

# find clonotypes with more than one `clonotype_same_v`
ct_different_v = adata.obs.groupby("ct_cluster_aa_alignment").apply(
    lambda x: x["ct_cluster_aa_alignment_same_v"].unique().size > 1
)
ct_different_v = ct_different_v[ct_different_v].index.values
ct_different_v

[27]:

['6', '12', '13', '90', '95', ..., '1507', '1518', '1814', '1857', '1860']
Length: 57
Categories (2396, object): ['0', '1', '2', '3', ..., '2392', '2393', '2394', '2395']

[28]:

# Display the first 2 clonotypes with different v genes
adata.obs.loc[
    adata.obs["ct_cluster_aa_alignment"].isin(ct_different_v[:2]),
    [
        "ct_cluster_aa_alignment",
        "ct_cluster_aa_alignment_same_v",
        "IR_VJ_1_v_gene",
        "IR_VDJ_1_v_gene",
    ],
].sort_values("ct_cluster_aa_alignment").drop_duplicates().reset_index(drop=True)

[28]:

	ct_cluster_aa_alignment	ct_cluster_aa_alignment_same_v	IR_VJ_1_v_gene	IR_VDJ_1_v_gene
0	6	6_nan_TRBV4-2_TCR	nan	TRBV4-2
1	6	6_nan_TRBV5-4_TCR	nan	TRBV5-4
2	6	6_nan_TRBV5-8_TCR	nan	TRBV5-8
3	6	6_nan_TRBV4-1_TCR	nan	TRBV4-1
4	6	6_nan_TRBV12-4_TCR	nan	TRBV12-4
5	6	6_nan_TRBV5-1_TCR	nan	TRBV5-1
6	6	6_nan_TRBV7-9_TCR	nan	TRBV7-9
7	6	6_nan_TRBV11-2_TCR	nan	TRBV11-2
8	6	6_nan_TRBV10-2_TCR	nan	TRBV10-2
9	12	12_nan_TRBV6-5_TCR	nan	TRBV6-5
10	12	12_nan_TRBV9_TCR	nan	TRBV9

Clonotype analysis¶

Clonal expansion¶

Let’s visualize the number of expanded clonotypes (i.e. clonotypes consisting of more than one cell) by cell-type. The first option is to add a column with the scirpy.tl.clonal_expansion() to adata.obs and overlay it on the UMAP plot.

[29]:

ir.tl.clonal_expansion(adata)

clonal_expansion refers to expansion categories, i.e singleton clonotypes, clonotypes with 2 cells and more than 2 cells. The clonotype_size refers to the absolute number of cells in a clonotype.

[30]:

sc.pl.umap(adata, color=["clonal_expansion", "clonotype_size"])

... storing 'ct_cluster_aa_alignment_same_v' as categorical
... storing 'clonal_expansion' as categorical

../_images/tutorials_tutorial_3k_tcr_57_1.svg

The second option is to show the number of cells belonging to an expanded clonotype per category in a stacked bar plot, using the scirpy.pl.clonal_expansion() plotting function.

[31]:

ir.pl.clonal_expansion(adata, groupby="cluster", clip_at=4, normalize=False)

[31]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_59_1.svg

The same plot, normalized to cluster size. Clonal expansion is a sign of positive selection for a certain, reactive T-cell clone. It, therefore, makes sense that CD8+ effector T-cells have the largest fraction of expanded clonotypes.

[32]:

ir.pl.clonal_expansion(adata, "cluster")

[32]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_61_1.svg

Expectedly, the CD8+ effector T cells have the largest fraction of expanded clonotypes.

Consistent with this observation, they have the lowest scirpy.pl.alpha_diversity() of clonotypes.

[33]:

ax = ir.pl.alpha_diversity(adata, groupby="cluster")

../_images/tutorials_tutorial_3k_tcr_63_0.svg

Clonotype abundance¶

The function scirpy.pl.group_abundance() allows us to create bar charts for arbitrary categorial from obs. Here, we use it to show the distribution of the ten largest clonotypes across the cell-type clusters.

[34]:

ir.pl.group_abundance(adata, groupby="clonotype", target_col="cluster", max_cols=10)

[34]:

<AxesSubplot:title={'center':'Number of cells in clonotype by cluster'}, xlabel='clonotype', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_66_1.svg

It might be beneficial to normalize the counts to the number of cells per sample to mitigate biases due to different sample sizes:

[35]:

ir.pl.group_abundance(
    adata, groupby="clonotype", target_col="cluster", max_cols=10, normalize="sample"
)

[35]:

<AxesSubplot:title={'center':'Fraction of cluster in each clonotype'}, xlabel='clonotype', ylabel='Fraction of cells in sample'>

../_images/tutorials_tutorial_3k_tcr_68_1.svg

Coloring the bars by patient gives us information about public and private clonotypes: Some clonotypes are private, i.e. specific to a certain tissue, others are public, i.e. they are shared across different tissues.

[36]:

ax = ir.pl.group_abundance(
    adata, groupby="clonotype", target_col="source", max_cols=15, fig_kws={"dpi": 100}
)

../_images/tutorials_tutorial_3k_tcr_70_0.svg

However, clonotypes that are shared between patients are rare:

[37]:

ax = ir.pl.group_abundance(
    adata, groupby="clonotype", target_col="patient", max_cols=15, fig_kws={"dpi": 100}
)

../_images/tutorials_tutorial_3k_tcr_72_0.svg

Gene usage¶

scirpy.tl.group_abundance() can also give us some information on VDJ usage. We can choose any of the {TRA,TRB}_{1,2}_{v,d,j,c}_gene columns to make a stacked bar plot. We use max_col to limit the plot to the 10 most abundant V-genes.

[38]:

ir.pl.group_abundance(
    adata, groupby="IR_VJ_1_v_gene", target_col="cluster", normalize=True, max_cols=10
)

[38]:

<AxesSubplot:title={'center':'Fraction of cluster in each IR_VJ_1_v_gene'}, xlabel='IR_VJ_1_v_gene', ylabel='Fraction of cells in cluster'>

../_images/tutorials_tutorial_3k_tcr_75_1.svg

We can pre-select groups by filtering adata:

[39]:

ir.pl.group_abundance(
    adata[
        adata.obs["IR_VDJ_1_v_gene"].isin(
            ["TRBV20-1", "TRBV7-2", "TRBV28", "TRBV5-1", "TRBV7-9"]
        ),
        :,
    ],
    groupby="cluster",
    target_col="IR_VDJ_1_v_gene",
    normalize=True,
)

Trying to set attribute `.uns` of view, copying.

[39]:

<AxesSubplot:title={'center':'Fraction of IR_VDJ_1_v_gene in each cluster'}, xlabel='cluster', ylabel='Fraction of cells in IR_VDJ_1_v_gene'>

../_images/tutorials_tutorial_3k_tcr_77_2.svg

The exact combinations of VDJ genes can be visualized as a Sankey-plot using scirpy.pl.vdj_usage().

[40]:

ir.pl.vdj_usage(adata, full_combination=False, max_segments=None, max_ribbons=30)

[40]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_79_1.svg

We can also use this plot to investigate the exact VDJ composition of one (or several) clonotypes:

[41]:

ir.pl.vdj_usage(
    adata[adata.obs["clonotype"].isin(["274_TCR", "277_TCR", "211_TCR", "106_TCR"]), :],
    max_ribbons=None,
    max_segments=100,
)

[41]:

<AxesSubplot:>

../_images/tutorials_tutorial_3k_tcr_81_1.svg

Spectratype plots¶

spectratype() plots give us information about the length distribution of CDR3 regions.

[42]:

ir.pl.spectratype(adata, color="cluster", viztype="bar", fig_kws={"dpi": 120})

[42]:

<AxesSubplot:title={'center':'Spectratype of IR_VJ_1_cdr3 by cluster'}, xlabel='IR_VJ_1_cdr3 length', ylabel='Number of cells'>

../_images/tutorials_tutorial_3k_tcr_84_1.svg

The same chart visualized as “ridge”-plot:

[43]:

ir.pl.spectratype(
    adata,
    color="cluster",
    viztype="curve",
    curve_layout="shifted",
    fig_kws={"dpi": 120},
    kde_kws={"kde_norm": False},
)

../../scirpy/_plotting/base.py:256: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(order)

[43]:

<AxesSubplot:title={'center':'Spectratype of IR_VJ_1_cdr3 by cluster'}, xlabel='IR_VJ_1_cdr3 length'>

../_images/tutorials_tutorial_3k_tcr_86_2.svg

A spectratype-plot by gene usage. To pre-select specific genes, we can simply filter the adata object before plotting.

[44]:

ir.pl.spectratype(
    adata[
        adata.obs["IR_VDJ_1_v_gene"].isin(
            ["TRBV20-1", "TRBV7-2", "TRBV28", "TRBV5-1", "TRBV7-9"]
        ),
        :,
    ],
    cdr3_col="IR_VDJ_1_cdr3",
    color="IR_VDJ_1_v_gene",
    normalize="sample",
    fig_kws={"dpi": 120},
)

Trying to set attribute `.uns` of view, copying.

[44]:

<AxesSubplot:title={'center':'Spectratype of IR_VDJ_1_cdr3 by IR_VDJ_1_v_gene'}, xlabel='IR_VDJ_1_cdr3 length', ylabel='Fraction of cells in sample'>

../_images/tutorials_tutorial_3k_tcr_88_2.svg

Comparing repertoires¶

Repertoire simlarity and overlaps¶

Overlaps in the adaptive immune receptor repertoire of samples or sample groups enables to pinpoint important clonotype groups, as well as to provide a measure of similarity between samples. Running Scirpy’s repertoire_overlap() tool creates a matrix featuring the abundance of clonotypes in each sample. Additionally, it also computes a (Jaccard) distance matrix of samples as well as the linkage of hierarchical clustering.

[45]:

df, dst, lk = ir.tl.repertoire_overlap(adata, "sample", inplace=False)
df.head()

[45]:

clonotype	0_TCR	1_TCR	2_TCR	3_TCR	4_TCR	5_TCR	6_TCR	7_TCR	8_TCR	9_TCR	...	2505_TCR	2506_TCR	2507_TCR	2508_TCR	2509_TCR	2510_TCR	2511_TCR	2512_TCR	2513_TCR	2514_TCR
sample
CN1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
CN2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
CT1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
CT2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
EN1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 2515 columns

The distance matrix can be shown as a heatmap, while samples are reordered based on hierarchical clustering.

[46]:

ir.pl.repertoire_overlap(adata, "sample", heatmap_cats=["patient", "source"])

[46]:

<seaborn.matrix.ClusterGrid at 0x7fb76cbf3650>

../_images/tutorials_tutorial_3k_tcr_93_1.svg

A specific pair of samples can be compared on a scatterplot, where dot size corresponds to the number of clonotypes at a given coordinate.

[47]:

ir.pl.repertoire_overlap(
    adata, "sample", pair_to_plot=["LN2", "LT2"], fig_kws={"dpi": 120}
)

No handles with labels found to put in legend.

[47]:

<AxesSubplot:title={'center':'Repertoire overlap between LN2 and LT2'}, xlabel='Clonotype size in LN2', ylabel='Clonotype size in LT2'>

../_images/tutorials_tutorial_3k_tcr_95_2.svg

Clonotypes preferentially occuring in a group¶

Clonotypes associated with an experimental group (a given cell type, samle or diagnosis) might be important candidates as biomarkers or disease drivers. Scirpy offers clonotype_imbalance() to rank clonotypes based on Fisher’s exact test comparing the fractional presence of a given clonotype in two groups.

A possible grouping criterion could be Tumor vs. Control, separately for distinct tumor types. The site of the tumor can be extracted from patient metadata.

[48]:

adata.obs["site"] = adata.obs["patient"].str.slice(stop=-1)

[49]:

ir.pl.clonotype_imbalance(
    adata,
    replicate_col="sample",
    groupby="source",
    case_label="Tumor",
    additional_hue="site",
    plot_type="strip",
)

WARNING: Clonotype imbalance not found. Running `ir.tl.clonotype_imbalance` and storing under {key_added}
WARNING: Clonotype imbalance calculation depends on repertoire overlap. We could not detect any previous runs of repertoire_overlap, so the tool is running now...

../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in log2
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in double_scalars
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in log2
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in double_scalars
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in log2
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in double_scalars
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in double_scalars
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)
../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in log2
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)

[49]:

<seaborn.axisgrid.FacetGrid at 0x7fb76d0b6810>

../_images/tutorials_tutorial_3k_tcr_100_3.svg

To get an idea how the above, top-ranked clonotypes compare to the bulk of all clonotypes, a Volcano plot is genereated, showing the -log10 p-value of the Fisher’s test as a function of log2(fold-change) of the normalized proportion of a given clonotype in the test group compared to the control group. To avoid zero division, 0.01*(global minimum proportion) was added to every normalized clonotype proportions.

[50]:

ir.pl.clonotype_imbalance(
    adata,
    replicate_col="sample",
    groupby="source",
    case_label="Tumor",
    additional_hue="diagnosis",
    plot_type="volcano",
    fig_kws={"dpi": 120},
)

No handles with labels found to put in legend.

[50]:

<AxesSubplot:title={'center':'Volcano plot'}, xlabel='log2FoldChange', ylabel='-log10(p-value)'>

../_images/tutorials_tutorial_3k_tcr_102_2.svg

Integrating gene expression¶

Clonotype imbalance among cell clusters¶

Leveraging the opportunity offered by close integeration with scanpy, transcriptomics-based data can be utilized directly. Using cell type annotation inferred from gene expression clusters, for example, clonotypes belonging to CD8+ effector T-cells and CD8+ tissue-resident memory T cells, can be compared.

[51]:

freq, stat = ir.tl.clonotype_imbalance(
    adata,
    replicate_col="sample",
    groupby="cluster",
    case_label="CD8_Teff",
    control_label="CD8_Trm",
    inplace=False,
)
top_differential_clonotypes = stat["clonotype"].tolist()[:5]

WARNING: Clonotype imbalance calculation depends on repertoire overlap. We could not detect any previous runs of repertoire_overlap, so the tool is running now...

../../scirpy/_tools/_clonotype_imbalance.py:273: RuntimeWarning: divide by zero encountered in log2
  (case_mean_freq + global_minimum) / (control_mean_freq + global_minimum)

Showing top clonotypes on a UMAP clearly shows that clonotype 163 is featured by CD8+ tissue-resident memory T cells, while clonotype 277 by CD8+ effector T-cells.

[52]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4), gridspec_kw={"wspace": 0.6})
sc.pl.umap(adata, color="cluster", ax=ax1, show=False)
sc.pl.umap(
    adata,
    color="clonotype",
    groups=top_differential_clonotypes,
    ax=ax2,
    # increase size of highlighted dots
    size=[
        80 if c in top_differential_clonotypes else 30 for c in adata.obs["clonotype"]
    ],
)

... storing 'site' as categorical

../_images/tutorials_tutorial_3k_tcr_106_1.svg

Repertoire overlap of cell types¶

Just like comparing repertoire overlap among samples, Scirpy also offers comparison between gene expression clusters or cell subpopulations. As an example, repertoire overlap of the two cell types compared above is shown.

[53]:

ir.tl.repertoire_overlap(adata, "cluster")
ir.pl.repertoire_overlap(
    adata, "cluster", pair_to_plot=["CD8_Teff", "CD8_Trm"], fig_kws={"dpi": 120}
)

No handles with labels found to put in legend.

[53]:

<AxesSubplot:title={'center':'Repertoire overlap between CD8_Teff and CD8_Trm'}, xlabel='Clonotype size in CD8_Teff', ylabel='Clonotype size in CD8_Trm'>

../_images/tutorials_tutorial_3k_tcr_108_2.svg

Marker genes in top clonotypes¶

Gene expression of cells belonging to individual clonotypes can also be compared using Scanpy. As an example, differential gene expression of two clonotypes, found to be specific to cell type clusters can also be analysed.

[54]:

sc.tl.rank_genes_groups(
    adata, "clonotype", groups=["163_TCR"], reference="277_TCR", method="wilcoxon"
)
sc.pl.rank_genes_groups_violin(adata, groups="163_TCR", n_genes=15)

ranking genes
    consider 'clonotype' groups:
    with sizes: [19 12]
--> Few observations in a group for normal approximation (<=25). Lower test accuracy.
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

../_images/tutorials_tutorial_3k_tcr_110_1.svg