scirpy.io.read_airr
- scirpy.io.read_airr(path, use_umi_count_col='auto', infer_locus=True, cell_attributes=('is_cell', 'high_confidence', 'multi_chain'), include_fields=('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count'))
Read data from AIRR rearrangement format.
- The following columns are required by scirpy:
cell_idproductivelocusat least one of
consensus_count,duplicate_count, orumi_countat least one of
junction_aaorjunction.
Data should still import if one of these fields is missing, but they are required by most of scirpy’s processing functions. All chains for which the field
junction_aais missing or empty, will be considered as non-productive and will be moved to theextra_chainscolumn.Note
- Reading data into Scirpy has the following constraints:
Each cell can have up to four productive chains chains (Dual IR): two VJ and two VDJ chains.
Excess chains are ignored (those with lowest read count/UMI count) and cells flagged as Multichain-cell.
Non-productive chains are ignored.
Chain loci must be valid IMGT locus names.
Excess chains, non-productive chains, chains without a CDR3 sequence, or chains with invalid loci are serialized to JSON and stored in the
extra_chainscolumn. They are not used by scirpy except when exporting theAnnDataobject to AIRR format.
For more information, see Immune receptor (IR) model.
- Parameters
- path :
str|Sequence[str] |Path|Sequence[Path] |DataFrame|Sequence[DataFrame]Union[str,Sequence[str],Path,Sequence[Path],DataFrame,Sequence[DataFrame]] Path to the AIRR rearrangement tsv file. If different chains are split up into multiple files, these can be specified as a List, e.g.
["path/to/tcr_alpha.tsv", "path/to/tcr_beta.tsv"]. Alternatively, this can be a pandas data frame.- use_umi_count_col :
bool| {‘auto’}Union[bool,Literal[‘auto’]] (default:'auto') Whether to add UMI counts from the non-strandard (but common)
umi_countcolumn. When this column is used, the UMI counts are moved over to the standardduplicate_countcolumn. Default: Useumi_countif there is noduplicate_countcolumn present.- infer_locus :
bool(default:True) Try to infer the
locuscolumn from gene names, in case it is not specified.- cell_attributes :
Collection[str] (default:('is_cell', 'high_confidence', 'multi_chain')) Fields in the rearrangement schema that are specific for a cell rather than a chain. The values must be identical over all records belonging to a cell. This defaults to
("is_cell","high_confidence","multi_chain").- include_fields :
Collection[str] |NoneOptional[Collection[str]] (default:('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count')) The fields to include in
adata. The AIRR rearrangment schema contains can contain a lot of columns, most of which irrelevant for most analyses. Per default, this includes a subset of columns relevant for a typical scirpy analysis, to keepadata.obsa bit cleaner. Defaults to("productive","locus","v_call","d_call","j_call","c_call","junction","junction_aa","consensus_count","duplicate_count"). Set this toNoneto include all columns.
- path :
- Return type
- Returns
AnnData object with IR data in
obsfor each cell. For more details see Data structure.