[1]:
# This cell is for development only. Don't copy this to your notebook.
%load_ext autoreload
%autoreload 2
import anndata

anndata.logging.anndata_logger.addFilter(
    lambda r: not r.getMessage().startswith("storing")
    and r.getMessage().endswith("as categorical.")
)

# Temporarily suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
[2]:
import scirpy as ir
import scanpy as sc
from glob import glob
import pandas as pd
import tarfile
import anndata
import warnings

sc.set_figure_params(figsize=(4, 4))
sc.settings.verbosity = 2  # verbosity: errors (0), warnings (1), info (2), hints (3)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Loading adaptive Immune Receptor (IR)-sequencing data with Scirpy

In this notebook, we demonstrate how single-cell IR-data can be imported into an AnnData object for the use with Scirpy. To learn more about AnnData and how Scirpy makes use of it, check out the data structure section.

The example data used in this notebook are available from the Scirpy repository.

Important

The Scirpy data model

Currently, the Scirpy data model has the following constraints:

  • BCR and TCR chains are supported. Chain loci must be valid Chain locus, i.e. one of TRA, TRG, IGK, or IGL (chains with a VJ junction) or TRB, TRD, or IGH (chains with a VDJ junction).

  • Each cell can contain up to two VJ and two VDJ chains (Dual IR). Excess chains are ignored (those with lowest read count/UMI count) and cells flagged as Multichain-cell.

  • Non-productive chains are ignored. CellRanger, TraCeR, and the AIRR rearrangment format flag these cells appropriately. When reading custom formats, you need to pass the flag explicitly or filter the chains beforehand.

  • Excess chains, non-productive chains, or chains with invalid loci are serialized to JSON and stored in the extra_chains column. They are not used by scirpy except when exporting the AnnData object to AIRR format.

For more information, see Immune receptor (IR) model.

Note

IR quality control

  • After importing the data, we recommend running the scirpy.tl.chain_qc() function. It will

    1. identify the Receptor type and Receptor subtype and flag cells as ambiguous that cannot unambigously be assigned to a certain receptor (sub)type, and

    2. flag cells with orphan chains (i.e. cells with only a single detected cell) and multichain-cells (i.e. cells with more than two full pairs of VJ- and VDJ-chains).

  • We recommend excluding multichain- and ambiguous cells as these likely represent doublets

  • Based on the orphan chain flags, the corresponding cells can be excluded. Alternatively, these cells can be matched to clonotypes on a single chain only, by using the receptor_arms=”any” parameter when running scirpy.tl.define_clonotypes().

Loading data from estabilshed analysis pipelines or AIRR-compliant tools

We provide convenience functions to load data from 10x CellRanger, BD Rhapsody, TraCeR, or BraCeR with a single function call. Moreover, we support importing data in the community-standard AIRR rearrangement schema.

read_10x_vdj(path[, filtered, include_fields])

Read IR data from 10x Genomics cell-ranger output.

read_tracer(path)

Read data from TraCeR ([SLonnbergP+16]).

read_bracer(path)

Read data from BraCeR ([LEM+18]).

read_airr(path[, use_umi_count_col, …])

Read data from AIRR rearrangement format.

read_bd_rhapsody(path[, dominant])

Read IR data from the BD Rhapsody Analysis Pipeline.

from_dandelion(dandelion[, transfer])

Import data from Dandelion ([SRB+21]).

Read 10x data

With read_10x_vdj() we can load filtered_contig_annotations.csv or contig_annotations.json files as they are produced by CellRanger. Here, we demonstrate how to load paired single cell transcriptomics and TCR sequencing data from COVID19 patients from GSE145926 ([LLY+20]).

[3]:
# Load the TCR data
adata_tcr = ir.io.read_10x_vdj(
    "example_data/liao-2019-covid19/GSM4385993_C144_filtered_contig_annotations.csv.gz"
)

# Load the associated transcriptomics data
adata = sc.read_10x_h5(
    "example_data/liao-2019-covid19/GSM4339772_C144_filtered_feature_bc_matrix.h5"
)
reading example_data/liao-2019-covid19/GSM4339772_C144_filtered_feature_bc_matrix.h5
 (0:00:00)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\anndata\_core\anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

This particular sample only has a detected TCR for a small fraction of the cells:

[4]:
adata_tcr.shape
[4]:
(136, 0)
[5]:
adata.shape
[5]:
(3716, 33539)

Next, we integrate both the TCR and the transcriptomics data into a single anndata.AnnData object using scirpy.pp.merge_with_ir():

[6]:
ir.pp.merge_with_ir(adata, adata_tcr)

Now, we can use TCR-related variables together with the gene expression data. Here, we visualize the cells with a detected TCR on the UMAP plot. It is reassuring that the TCRs coincide with the T-cell marker gene CD3.

[7]:
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
computing PCA
    with n_comps=50
    finished (0:00:02)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:09)
computing UMAP
    finished (0:00:09)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/tutorials_tutorial_io_10_2.png

Read Smart-seq2 data processed with TraCeR

TraCeR ([SLonnbergP+16]) is a method commonly used to extract TCR sequences from data generated with Smart-seq2 or other full-length single-cell sequencing protocols. Nf-core provides a full pipeline for processing Smart-seq2 sequencing data.

The scirpy.io.read_tracer() function obtains its TCR information from the .pkl file in the filtered_TCR_seqs folder TraCeR generates for each cell.

For this example, we load the ~500 cells from triple-negative breast cancer patients from GSE75688 ([CEL+17]). The raw data has been processed using the aforementioned Smart-seq2 pipeline from nf-core.

[8]:
# extract data
with tarfile.open("example_data/chung-park-2017.tar.bz2", "r:bz2") as tar:
    tar.extractall("example_data/chung-park-2017")
[9]:
# Load transcriptomics data from count matrix
expr_chung = pd.read_csv("example_data/chung-park-2017/counts.tsv", sep="\t")
# anndata needs genes in columns and samples in rows
expr_chung = expr_chung.set_index("Geneid").T
adata = sc.AnnData(expr_chung)
adata.shape
[9]:
(563, 23438)
[10]:
adata_tcr.obs
[10]:
is_cell high_confidence multi_chain extra_chains IR_VJ_1_c_call IR_VJ_2_c_call IR_VDJ_1_c_call IR_VDJ_2_c_call IR_VJ_1_consensus_count IR_VJ_2_consensus_count ... IR_VDJ_2_locus IR_VJ_1_productive IR_VJ_2_productive IR_VDJ_1_productive IR_VDJ_2_productive IR_VJ_1_v_call IR_VJ_2_v_call IR_VDJ_1_v_call IR_VDJ_2_v_call has_ir
obs_names
AAACGGGCAGTAAGAT-1 True True False [] TRAC NaN TRBC2 NaN 40742.0 NaN ... NaN True None True None TRAV12-2 NaN TRBV15 NaN True
AAAGCAATCCGGGTGT-1 True True False [{"c_call": "TRBC2", "consensus_count": 7866, ... TRAC NaN TRBC2 NaN 32200.0 NaN ... NaN True None True None TRAV12-2 NaN TRBV15 NaN True
AAATGCCAGAAGGTGA-1 True True False [{"c_call": "TRAC", "consensus_count": 13546, ... TRAC NaN TRBC1 NaN 37692.0 NaN ... NaN True None True None TRAV21 NaN TRBV4-1 NaN True
AACCATGTCCCTGACT-1 True True False [{"c_call": "TRBC1", "consensus_count": 27398,... TRAC NaN TRBC2 NaN 20780.0 NaN ... NaN True None True None TRAV4 NaN TRBV24-1 NaN True
AACCGCGGTATTACCG-1 True True False [{"c_call": "TRBC1", "consensus_count": 38186,... TRAC NaN TRBC2 NaN 21536.0 NaN ... NaN True None True None TRAV1-2 NaN TRBV12-4 NaN True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTAGGACAGGGTGTGT-1 True True False [] TRAC NaN TRBC2 NaN 22304.0 NaN ... NaN True None True None TRAV8-3 NaN TRBV12-4 NaN True
TTCCCAGAGGCTATCT-1 True True False [] TRAC NaN TRBC2 NaN 11806.0 NaN ... NaN True None True None TRAV12-2 NaN TRBV6-5 NaN True
TTGCCGTGTGCATCTA-1 True True False [] TRAC NaN TRBC1 NaN 21220.0 NaN ... NaN True None True None TRAV17 NaN TRBV20-1 NaN True
TTGCGTCAGAGTAATC-1 True True False [] TRAC NaN TRBC2 NaN 21436.0 NaN ... NaN True None True None TRAV29/DV5 NaN TRBV6-6 NaN True
TTGGAACGTACATGTC-1 True True False [] TRAC TRAC TRBC1 NaN 33084.0 7762.0 ... NaN True True True None TRAV19 TRAV17 TRBV9 NaN True

136 rows × 45 columns

[11]:
# Load TCR data and merge it with transcriptomics data
adata_tcr = ir.io.read_tracer("example_data/chung-park-2017/tracer/")
ir.pp.merge_with_ir(adata, adata_tcr)
[12]:
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=3000)
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:00)
computing UMAP
    finished (0:00:02)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/tutorials_tutorial_io_16_2.png

Read an AIRR-compliant rearrangement table

We generated example data using immuneSIM ([WAY+20]). The data consists of 100 cells and does not include transcriptomics data.

The rearrangement tables are often organized into separate tables per chain. Therefore, scirpy.io.read_airr() supports specifiying multiple tsv files at once. This would have the same effect as concatenating them before the import.

[13]:
adata = ir.io.read_airr(
    [
        "example_data/immunesim_airr/immunesim_tra.tsv",
        "example_data/immunesim_airr/immunesim_trb.tsv",
    ]
)
ir.tl.chain_qc(adata)

The dataset does not come with transcriptomics data. We can, therefore, not show the UMAP plot highlighting cells with TCRs, but we can still use scirpy to analyse it. Below, we visualize the clonotype network connecting cells with similar CDR3 sequences.

Note: The cutoff of 25 was chosen for demonstration purposes on this small sample dataset. Usually a smaller cutoff is more approriate.

[14]:
ir.pp.ir_dist(adata, metric="alignment", sequence="aa", cutoff=25)
Computing sequence x sequence distance matrix for VJ sequences.
100%|██████████| 3/3 [00:21<00:00,  7.20s/it]
Computing sequence x sequence distance matrix for VDJ sequences.
100%|██████████| 3/3 [00:21<00:00,  7.14s/it]
[15]:
ir.tl.define_clonotype_clusters(
    adata,
    metric="alignment",
    sequence="aa",
    receptor_arms="any",
    dual_ir="primary_only",
)
ir.tl.clonotype_network(adata, size_aware=False, metric="alignment", sequence="aa")
ir.pl.clonotype_network(
    adata,
    color="cc_aa_alignment",
    base_size=20,
    panel_size=(6, 6),
    show_legend=False,
    show_size_legend=False,
)
Initializing lookup tables.
Computing clonotype x clonotype distances.
100%|██████████| 100/100 [00:00<00:00, 639.98it/s]
Stored clonal assignments in `adata.obs["cc_aa_alignment"]`.
[15]:
<AxesSubplot: title={'center': 'cc_aa_alignment'}>
../_images/tutorials_tutorial_io_21_4.png

Creating AnnData objects from other formats

Often, immune receptor (IR) data are just provided as a simple table listing the CDR3 sequences for each cell. We provide a generic data structure for cells with IRs, which can then be converted into an AnnData object.

AirrCell(cell_id[, cell_attribute_fields, …])

Data structure for a Cell with immune receptors.

from_airr_cells(airr_cells[, include_fields])

Convert a collection of AirrCell objects to AnnData.

If you believe you are working with a commonly used format, consider sending a feature request for a read_XXX function.

For this example, we again load the triple-negative breast cancer data from [CEL+17]. However, this time, we retrieve the TCR data from a separate summary table containing the TCR information (we generated this table for the sake of the example, but it could as well be a supplementary file from the paper).

Such a table typically contains information about

  • CDR3 sequences (amino acid and/or nucleotide)

  • The Chain locus, e.g. TRA, TRB, or IGH.

  • expression of the receptor chain (e.g. count, UMI, transcripts per million (TPM))

  • the V(D)J genes for each chain

  • information if the chain is productive.

[16]:
tcr_table = pd.read_csv(
    "example_data/chung-park-2017/tcr_table.tsv",
    sep="\t",
    index_col=0,
    na_values=["None"],
    true_values=["True"],
)
tcr_table
[16]:
cell_id cdr3_alpha cdr3_nt_alpha count_alpha v_alpha j_alpha cdr3_beta cdr3_nt_beta count_beta v_beta d_beta j_beta productive_alpha productive_beta
0 SRR2973278 AVSDIHASGGSYIPT GCTGTTTCGGATATTCATGCATCAGGAGGAAGCTACATACCTACA 9.29463 TRAV21 TRAJ6 ASSWWQNTEAF GCCAGCAGCTGGTGGCAGAACACTGAAGCTTTC 37.5984 TRBV5-1 NaN TRBJ1-1 True True
1 SRR2973305 AVVTGANSKLT GCTGTGGTAACTGGAGCCAATAGTAAGCTGACA 89.45740 TRAV22 TRAJ56 NaN NaN NaN NaN NaN NaN True True
2 SRR2973371 ALKRTGNTPLV GCTCTGAAAAGAACAGGAAACACACCTCTTGTC 431.96500 TRAV9-2 TRAJ29 ASRSRDSGEPQH GCCAGCAGGAGCAGGGACAGCGGAGAGCCCCAGCAT 952.0230 TRBV10-2 TRBD1 TRBJ1-5 True True
3 SRR2973377 ATDPETSGSRLT GCTACGGACCCAGAAACCAGTGGCTCTAGGTTGACC 772.43600 TRAV17 TRAJ58 NaN NaN NaN NaN NaN NaN True True
4 SRR2973403 AVRGATDSWGKFQ GCTGTGAGAGGAGCAACTGACAGCTGGGGGAAATTCCAG 95.63640 TRAV3 TRAJ24 SVQTSEYEQY AGCGTCCAGACTAGCGAGTACGAGCAGTAC 205.8330 TRBV29-1 TRBD2 TRBJ2-7 True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 SRR5023618 NaN NaN NaN NaN NaN ASSDSPFSSYNEQF GCCAGCAGTGACTCGCCCTTTAGCTCCTACAATGAGCAGTTC 864.4550 TRBV6-4 NaN TRBJ2-1 True True
86 SRR5023621 AENSGGSNYKLT GCAGAGAATAGTGGAGGTAGCAACTATAAACTGACA 512.63000 TRAV13-2 TRAJ53 ASSPDGGGGYT GCCAGCAGCCCTGATGGGGGAGGGGGCTACACC 805.2010 TRBV7-3 TRBD2 TRBJ1-2 True True
87 SRR5023626 ALRIGSNYKLT GCTCTGAGAATCGGTAGCAACTATAAACTGACA 12.51630 TRAV9-2 TRAJ53 NaN NaN NaN NaN NaN NaN True True
88 SRR5023633 NaN NaN NaN NaN NaN ASGLGQSVGGTQY GCTAGTGGCCTAGGGCAGTCGGTAGGAGGGACCCAGTAC 18.4273 TRBV12-5 TRBD2 TRBJ2-5 True True
89 SRR5023639 NaN NaN NaN NaN NaN ASSKGSLGPAGELF GCCAGCAGCAAAGGATCGCTGGGGCCCGCCGGGGAGCTGTTT 905.9260 TRBV21-1 TRBD1 TRBJ2-2 True True

90 rows × 14 columns

Our task is now to dissect the table into AirrCell objects. Each AirrCell can have an arbitrary number of chains. A chain is simply represented as a Python dictionary following the AIRR Rearrangement Schema.

When converting the AirrCell objects into an AnnData object, scirpy will only retain at most two alpha and two beta chains per cell and flag cells which exceed this number as multichain cells. For more information, check the page about our Immune receptor (IR) model.

[17]:
tcr_cells = []
for idx, row in tcr_table.iterrows():
    cell = ir.io.AirrCell(cell_id=row["cell_id"])
    alpha_chain = ir.io.AirrCell.empty_chain_dict()
    beta_chain = ir.io.AirrCell.empty_chain_dict()
    alpha_chain.update(
        {
            "locus": "TRA",
            "junction_aa": row["cdr3_alpha"],
            "junction": row["cdr3_nt_alpha"],
            "consensus_count": row["count_alpha"],
            "v_call": row["v_alpha"],
            "j_call": row["j_alpha"],
            "productive": row["productive_alpha"],
        }
    )
    beta_chain.update(
        {
            "locus": "TRB",
            "junction_aa": row["cdr3_beta"],
            "junction": row["cdr3_nt_beta"],
            "consensus_count": row["count_beta"],
            "v_call": row["v_beta"],
            "d_call": row["d_beta"],
            "j_call": row["j_beta"],
            "productive": row["productive_beta"],
        }
    )
    cell.add_chain(alpha_chain)
    cell.add_chain(beta_chain)
    tcr_cells.append(cell)

Now, we can convert the list of AirrCell objects using scirpy.io.from_airr_cells().

[18]:
adata_tcr = ir.io.from_airr_cells(tcr_cells)
[19]:
adata_tcr.obs
[19]:
multi_chain extra_chains IR_VJ_1_consensus_count IR_VJ_2_consensus_count IR_VDJ_1_consensus_count IR_VDJ_2_consensus_count IR_VJ_1_d_call IR_VJ_2_d_call IR_VDJ_1_d_call IR_VDJ_2_d_call ... IR_VDJ_2_sequence_id IR_VJ_1_v_call IR_VJ_2_v_call IR_VDJ_1_v_call IR_VDJ_2_v_call IR_VJ_1_v_cigar IR_VJ_2_v_cigar IR_VDJ_1_v_cigar IR_VDJ_2_v_cigar has_ir
cell_id
SRR2973278 False [] 9.29463 NaN 37.5984 NaN NaN NaN NaN NaN ... NaN TRAV21 NaN TRBV5-1 NaN NaN NaN NaN NaN True
SRR2973305 False [{"consensus_count": null, "d_call": null, "d_... 89.45740 NaN NaN NaN NaN NaN NaN NaN ... NaN TRAV22 NaN NaN NaN NaN NaN NaN NaN True
SRR2973371 False [] 431.96500 NaN 952.0230 NaN NaN NaN TRBD1 NaN ... NaN TRAV9-2 NaN TRBV10-2 NaN NaN NaN NaN NaN True
SRR2973377 False [{"consensus_count": null, "d_call": null, "d_... 772.43600 NaN NaN NaN NaN NaN NaN NaN ... NaN TRAV17 NaN NaN NaN NaN NaN NaN NaN True
SRR2973403 False [] 95.63640 NaN 205.8330 NaN NaN NaN TRBD2 NaN ... NaN TRAV3 NaN TRBV29-1 NaN NaN NaN NaN NaN True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
SRR5023618 False [{"consensus_count": null, "d_call": null, "d_... NaN NaN 864.4550 NaN NaN NaN NaN NaN ... NaN NaN NaN TRBV6-4 NaN NaN NaN NaN NaN True
SRR5023621 False [] 512.63000 NaN 805.2010 NaN NaN NaN TRBD2 NaN ... NaN TRAV13-2 NaN TRBV7-3 NaN NaN NaN NaN NaN True
SRR5023626 False [{"consensus_count": null, "d_call": null, "d_... 12.51630 NaN NaN NaN NaN NaN NaN NaN ... NaN TRAV9-2 NaN NaN NaN NaN NaN NaN NaN True
SRR5023633 False [{"consensus_count": null, "d_call": null, "d_... NaN NaN 18.4273 NaN NaN NaN TRBD2 NaN ... NaN NaN NaN TRBV12-5 NaN NaN NaN NaN NaN True
SRR5023639 False [{"consensus_count": null, "d_call": null, "d_... NaN NaN 905.9260 NaN NaN NaN TRBD1 NaN ... NaN NaN NaN TRBV21-1 NaN NaN NaN NaN NaN True

90 rows × 67 columns

[20]:
# We can re-use the transcriptomics data from above...
adata = sc.AnnData(expr_chung)
# ... and merge it with the TCR data
ir.pp.merge_with_ir(adata, adata_tcr)
[21]:
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=3000)
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E"])
If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
    finished (0:00:00)
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:00)
computing UMAP
    finished (0:00:02)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/tutorials_tutorial_io_30_2.png

Combining multiple samples

It is quite common that the sequncing data is split up in multiple samples. To combine them into a single object, we load each sample independently using one of the approaches described in this document. Then, we combine them using anndata.concat().

Here is a full example loading and combining three samples from the COVID19 study by [LLY+20].

[22]:
# define sample metadata. Usually read from a file.
samples = {
    "C144": {"group": "mild"},
    "C146": {"group": "severe"},
    "C149": {"group": "healthy control"},
}
[23]:
# Create a list of AnnData objects (one for each sample)
adatas = []
for sample, sample_meta in samples.items():
    gex_file = glob(f"example_data/liao-2019-covid19/*{sample}*.h5")[0]
    tcr_file = glob(f"example_data/liao-2019-covid19/*{sample}*.csv.gz")[0]
    adata = sc.read_10x_h5(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)
    ir.pp.merge_with_ir(adata, adata_tcr)
    adata.obs["sample"] = sample
    adata.obs["group"] = sample_meta["group"]
    # concatenation only works with unique gene names
    adata.var_names_make_unique()
    adatas.append(adata)
reading example_data/liao-2019-covid19\GSM4339772_C144_filtered_feature_bc_matrix.h5
 (0:00:00)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\anndata\_core\anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
reading example_data/liao-2019-covid19\GSM4339774_C146_filtered_feature_bc_matrix.h5
 (0:00:00)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\anndata\_core\anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
reading example_data/liao-2019-covid19\GSM4475052_C149_filtered_feature_bc_matrix.h5
 (0:00:00)
WARNING: Non-standard locus name ignored: Multi
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\anndata\_core\anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[24]:
# Merge anndata objects
adata = anndata.concat(adatas)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\anndata\_core\anndata.py:1828: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

The data is now integrated in a single object. Again, the detected TCRs coincide with CD3E gene expression. We clearly observe batch effects between the samples – for a meaningful downstream analysis further processing steps such as highly-variable gene filtering and batch correction are necessary.

[25]:
sc.pp.log1p(adata)
sc.pp.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["has_ir", "CD3E", "sample"])
computing PCA
    with n_comps=50
    finished (0:00:07)
computing neighbors
    using 'X_pca' with n_pcs = 50
    finished (0:00:25)
computing UMAP
    finished (0:00:11)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/tutorials_tutorial_io_36_2.png