scirpy.io.read_airr

scirpy.io.read_airr(path, use_umi_count_col='auto', infer_locus=True, cell_attributes=('is_cell', 'high_confidence', 'multi_chain'), include_fields=('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count'))

Read data from AIRR rearrangement format.

The following columns are required by scirpy:
  • cell_id

  • productive

  • locus

  • at least one of consensus_count, duplicate_count, or umi_count

  • at least one of junction_aa or junction.

Data should still import if one of these fields is missing, but they are required by most of scirpy’s processing functions. All chains for which the field junction_aa is missing or empty, will be considered as non-productive and will be moved to the extra_chains column.

Note

Reading data into Scirpy has the following constraints:
  • Each cell can have up to four productive chains chains (Dual IR): two VJ and two VDJ chains.

  • Excess chains are ignored (those with lowest read count/UMI count) and cells flagged as Multichain-cell.

  • Non-productive chains are ignored.

  • Chain loci must be valid IMGT locus names.

  • Excess chains, non-productive chains, chains without a CDR3 sequence, or chains with invalid loci are serialized to JSON and stored in the extra_chains column. They are not used by scirpy except when exporting the AnnData object to AIRR format.

For more information, see Immune receptor (IR) model.

Parameters
path : str | Sequence[str] | Path | Sequence[Path] | DataFrame | Sequence[DataFrame]Union[str, Sequence[str], Path, Sequence[Path], DataFrame, Sequence[DataFrame]]

Path to the AIRR rearrangement tsv file. If different chains are split up into multiple files, these can be specified as a List, e.g. ["path/to/tcr_alpha.tsv", "path/to/tcr_beta.tsv"]. Alternatively, this can be a pandas data frame.

use_umi_count_col : bool | {‘auto’}Union[bool, Literal[‘auto’]] (default: 'auto')

Whether to add UMI counts from the non-strandard (but common) umi_count column. When this column is used, the UMI counts are moved over to the standard duplicate_count column. Default: Use umi_count if there is no duplicate_count column present.

infer_locus : bool (default: True)

Try to infer the locus column from gene names, in case it is not specified.

cell_attributes : Collection[str] (default: ('is_cell', 'high_confidence', 'multi_chain'))

Fields in the rearrangement schema that are specific for a cell rather than a chain. The values must be identical over all records belonging to a cell. This defaults to ("is_cell","high_confidence","multi_chain").

include_fields : Collection[str] | NoneOptional[Collection[str]] (default: ('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count'))

The fields to include in adata. The AIRR rearrangment schema contains can contain a lot of columns, most of which irrelevant for most analyses. Per default, this includes a subset of columns relevant for a typical scirpy analysis, to keep adata.obs a bit cleaner. Defaults to ("productive","locus","v_call","d_call","j_call","c_call","junction","junction_aa","consensus_count","duplicate_count"). Set this to None to include all columns.

Return type

AnnData

Returns

AnnData object with IR data in obs for each cell. For more details see Data structure.