scirpy.io.read_airr¶
-
scirpy.io.
read_airr
(path, use_umi_count_col='auto', infer_locus=True, cell_attributes='is_cell', 'high_confidence', 'multi_chain', include_fields='productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count')¶ Read data from AIRR rearrangement format.
- The following columns are required by scirpy:
cell_id
productive
locus
at least one of
consensus_count
,duplicate_count
, orumi_count
at least one of
junction_aa
orjunction
.
Data should still import if one of these fields is missing, but they are required by most of scirpy’s processing functions.
Note
- Reading data into Scirpy has the following constraints:
Each cell can have up to four productive chains chains (Dual IR): two VJ and two VDJ chains.
Excess chains are ignored (those with lowest read count/UMI count) and cells flagged as Multichain-cell.
Non-productive chains are ignored.
Chain loci must be valid IGMT locus names.
Excess chains, non-productive chains, or chains with invalid loci are serialized to JSON and stored in the
extra_chains
column. They are not used by scirpy except when exporting theAnnData
object to AIRR format.
For more information, see Immune receptor (IR) model.
- Parameters
- path :
str
|Sequence
[str
] |Path
|Sequence
[Path
]Union
[str
,Sequence
[str
],Path
,Sequence
[Path
]] Path to the AIRR rearrangement tsv file. If different chains are split up into multiple files, these can be specified as a List, e.g.
["path/to/tcr_alpha.tsv", "path/to/tcr_beta.tsv"]
.- use_umi_count_col :
bool
| {‘auto’}Union
[bool
,Literal
[‘auto’]] (default:'auto'
) Whether to add UMI counts from the non-strandard (but common)
umi_count
column. When this column is used, the UMI counts are moved over to the standardduplicate_count
column. Default: Useumi_count
if there is noduplicate_count
column present.- infer_locus :
bool
bool
(default:True
) Try to infer the
locus
column from gene names, in case it is not specified.- cell_attributes :
Collection
[str
]Collection
[str
] (default:('is_cell', 'high_confidence', 'multi_chain')
) Fields in the rearrangement schema that are specific for a cell rather than a chain. The values must be identical over all records belonging to a cell. This defaults to
("is_cell","high_confidence","multi_chain")
.- include_fields :
Collection
[str
] |None
Optional
[Collection
[str
]] (default:('productive', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_aa', 'consensus_count', 'duplicate_count')
) The fields to include in
adata
. The AIRR rearrangment schema contains can contain a lot of columns, most of which irrelevant for most analyses. Per default, this includes a subset of columns relevant for a typical scirpy analysis, to keepadata.obs
a bit cleaner. Defaults to("productive","locus","v_call","d_call","j_call","c_call","junction","junction_aa","consensus_count","duplicate_count")
. Set this toNone
to include all columns.
- path :
- Return type
- Returns
AnnData object with IR data in
obs
for each cell. For more details see Data structure.