scirpy.io.read_airr¶

scirpy.io.read_airr(path, use_umi_count_col='auto', infer_locus=True, cell_attributes='is_cell', 'high_confidence', 'multi_chain', include_fields='productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count')¶

Read data from AIRR rearrangement format.

The following columns are required by scirpy:

cell_id
productive
locus
at least one of consensus_count, duplicate_count, or umi_count
at least one of junction_aa or junction.

Data should still import if one of these fields is missing, but they are required by most of scirpy’s processing functions.

Note

Reading data into Scirpy has the following constraints:

Each cell can have up to four productive chains chains (Dual IR): two VJ and two VDJ chains.
Excess chains are ignored (those with lowest read count/UMI count) and cells flagged as Multichain-cell.
Non-productive chains are ignored.
Chain loci must be valid IGMT locus names.
Excess chains, non-productive chains, or chains with invalid loci are serialized to JSON and stored in the extra_chains column. They are not used by scirpy except when exporting the AnnData object to AIRR format.

For more information, see Immune receptor (IR) model.

Parameters

path : str | Sequence[str] | Path | Sequence[Path]Union[str, Sequence[str], Path, Sequence[Path]]: Path to the AIRR rearrangement tsv file. If different chains are split up into multiple files, these can be specified as a List, e.g. ["path/to/tcr_alpha.tsv", "path/to/tcr_beta.tsv"].
use_umi_count_col : bool | {‘auto’}Union[bool, Literal[‘auto’]] (default: 'auto'): Whether to add UMI counts from the non-strandard (but common) umi_count column. When this column is used, the UMI counts are moved over to the standard duplicate_count column. Default: Use umi_count if there is no duplicate_count column present.
infer_locus : boolbool (default: True): Try to infer the locus column from gene names, in case it is not specified.
cell_attributes : Collection[str]Collection[str] (default: ('is_cell', 'high_confidence', 'multi_chain')): Fields in the rearrangement schema that are specific for a cell rather than a chain. The values must be identical over all records belonging to a cell. This defaults to ("is_cell","high_confidence","multi_chain").
include_fields : Collection[str] | NoneOptional[Collection[str]] (default: ('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count')): The fields to include in adata. The AIRR rearrangment schema contains can contain a lot of columns, most of which irrelevant for most analyses. Per default, this includes a subset of columns relevant for a typical scirpy analysis, to keep adata.obs a bit cleaner. Defaults to ("productive","locus","v_call","d_call","j_call","c_call","junction","junction_aa","consensus_count","duplicate_count"). Set this to None to include all columns.

Return type

AnnDataAnnData

Returns

AnnData object with IR data in obs for each cell. For more details see Data structure.