Shadows features¶
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import os
os.chdir("../../")
[3]:
from pathlib import Path
data = Path("data/")
Shadow objects and their features¶
While shadow objects provide a convenient read-only drop-in replacement for AnnData/MuData objects when needed, they also have additional features that can help users make the most of shadows.
Import classes for these shadow objects:
[4]:
from shadows import AnnDataShadow, MuDataShadow
Initialise a multimodal shadow object:
[5]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu"
mdata = MuDataShadow(file)
File¶
The file connection that the shadow is using can be accessed via the .file
attribute:
[6]:
mdata.file
[6]:
<HDF5 file "pbmc5k_citeseq_processed.h5mu" (mode r)>
The name of the file can then be accessed via
[7]:
mdata.file.filename
[7]:
'data/pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu'
The connection stays open until mdata.close()
is called
[8]:
mdata.close()
… or until the file has to be re-opened for modification (see below).
Permissions¶
We can open HDF5 files in different modes including purely read-only ('r'
) and read/write ('r+'
). The mode can be provided to the constructor:
[9]:
mdata = MuDataShadow(file, mode="r")
mdata.file.mode
[9]:
'r'
Let’s add some data to the in-memory shadow object:
[10]:
mdata["rna"].obsm["X_pca_copy"] = mdata["rna"].obsm["X_pca"].copy()
We can also conveniently close and reopen the connection for a given in-memory shadow object:
[11]:
mdata.reopen(mode="r+")
mdata.file.mode
[11]:
'r+'
This way all the newly added elements are still available in memory:
[12]:
mdata["rna"].obsm
[12]:
obsm: X_pcaᐁ, X_umap, X_pca_copy▲
[13]:
# Clean up
mdata.close()
del mdata
Individual modality access¶
Individual modalities stored in the .h5mu files can be accessed as part of the MuDataShadow
object:
[14]:
mdata = MuDataShadow(file, mode="r")
mdata["rna"]
[14]:
AnnData Shadow object with n_obs × n_vars = 3891 × 17806
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Moreover, one can also create a direct connection to a specific modality:
[15]:
mdata.close()
del mdata
adata = AnnDataShadow(file / "mod/rna")
adata
[15]:
AnnData Shadow object with n_obs × n_vars = 3891 × 17806
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[16]:
# Clean up
adata.close()
del adata
Class identity¶
Many tools in the ecosystem including scanpy frequently check if the input object is an AnnData. For instance, in ``sc.pp.highly_variable_genes` <https://github.com/scverse/scanpy/blob/master/scanpy/preprocessing/_highly_variable_genes.py>`__ it reads:
if not isinstance(adata, AnnData):
raise ValueError(
'`pp.highly_variable_genes` expects an `AnnData` argument, '
'pass `inplace=False` if you want to return a `pd.DataFrame`.'
)
In order for shadow objects to be accepted by such functions, they mock their class identity:
[17]:
mdata = MuDataShadow(file, mode="r")
from mudata import MuData
assert isinstance(mdata, MuData), "mdata is not a valid MuData object"
[18]:
from anndata import AnnData
assert isinstance(mdata["rna"], AnnData), "mdata['rna'] is not a valid AnnData object"
Checking for shadow identity still works:
[19]:
isinstance(mdata, MuDataShadow)
[19]:
True
[20]:
isinstance(mdata["rna"], AnnDataShadow)
[20]:
True
[21]:
mdata.close()
Backends¶
AnnData/MuData are based on a NumPy/Pandas stack. This is the default for the shadow objects in order to provide compatibility with AnnData/MuData objects.
However the nature of shadow files also simplifies loading individual matrices or tables with alternative backends, e.g. JAX (Array
), PyTorch (Tensor
) or polars (DataFrame
).
[22]:
mdata = MuDataShadow(file, array_backend="jax", table_backend="polars")
[23]:
obs = mdata["rna"].obs
print(type(obs))
obs.head()
<class 'polars.internals.dataframe.frame.DataFrame'>
[23]:
_index | celltype | leiden | n_genes_by_counts | pct_counts_mt | total_counts | total_counts_mt |
---|---|---|---|---|---|---|
object | cat | cat | i32 | f32 | f32 | f32 |
AAACCCAAGAGACAAG-1 | "intermediate m... | "3" | 2363 | 6.332204 | 7375.0 | 467.0 |
AAACCCAAGGCCTAGA-1 | "CD4+ naïve T" | "0" | 1259 | 9.093319 | 3772.0 | 343.0 |
AAACCCAGTCGTGCCA-1 | "CD4+ memory T" | "2" | 1578 | 13.178295 | 4902.0 | 646.0 |
AAACCCATCGTGCATA-1 | "CD4+ memory T" | "2" | 1908 | 6.354415 | 6704.0 | 426.0 |
AAACGAAAGACAAGCC-1 | "CD14 mono" | "1" | 1589 | 9.307693 | 3900.0 | 363.0 |
[24]:
rna_pca = mdata["rna"].obsm["X_pca"]
print(type(rna_pca))
rna_pca
<class 'jaxlib.xla_extension.DeviceArray'>
[24]:
DeviceArray([[ 20.551052 , 0.36840764, -1.6193684 , ...,
0.09656975, -0.90912175, -0.77955467],
[ -9.47144 , -5.5212517 , -5.107428 , ...,
0.64674896, -0.892091 , 1.7873902 ],
[ -9.913012 , 2.766899 , -2.0684972 , ...,
-0.6454743 , 1.615869 , -0.63476324],
...,
[ -8.727723 , 7.9196725 , 1.3326805 , ...,
1.4592032 , 0.91210324, 1.3184382 ],
[-10.792531 , 3.2086673 , -2.0437238 , ...,
1.7311838 , -1.840564 , 1.3253008 ],
[ 20.642431 , 0.49294943, -1.6694897 , ...,
-0.51208967, 0.60652566, -0.75145006]], dtype=float32)
When alternative backends are being used, not all of the AnnData/MuData features can be supported, and many external tools might not work as expected as they anticipate NumPy/Pandas objects instead.
[25]:
# Clean up
mdata.clear_cache()
mdata.close()
del mdata, rna_pca, obs
Partial writing¶
[!NOTE] This feature is experimental.
While the main use of the shadows is to provide a low-memory read-only solution to scverse datasets, ability to add new embeddings or other items to the file can greatly extend its usage patterns.
[9]:
mdata = MuDataShadow(file, mode="r")
Add a new embedding to the in-memory object:
[10]:
mdata["rna"].obsm["X_pca_copy"] = mdata["rna"].obsm["X_pca"].copy()
mdata["rna"].obsm
[10]:
obsm: X_pcaᐁ, X_pca_copyᐁ, X_umap
For this, a family of methods is useful, including .reopen()
and .write()
. The .write()
method will only work if the connection is not read-only, e.g. 'r+'
, however it is possible to reopen the file in another mode.
Internally, .write()
pushes (._push_changes()
) the in-memory changes (marked with ▲ in the object representation above) to the file and provides meaningful error messages when the file is not open for writing.
This separation of concern makes it transparent when the data is modified, and this workflow can be recommended when barely any data are added to the file. As the methods return the shadow itself, it is possible to chain them:
[11]:
mdata.reopen(mode='r+').write(clear_cache=True).reopen(mode='r'); # clear pushed elements from cache
mdata["rna"].obsm
[11]:
obsm: X_pcaᐁ, X_pca_copyᐁ, X_umap
[12]:
mdata.file.mode
[12]:
'r'
[13]:
mdata.clear_cache()
Default mode is read-only, and it protects the files from being modified while also allowing for multiple connections to the file:
[17]:
try:
mdata.write()
except OSError as e:
print("Not available for .write():", e)
Not available for .write(): File is open in read-only mode. Changes can't be pushed. Reopen it with .reopen('r+') to enable writing.
[!NOTE] Partial writing is currently intended to add new elements to the dataset on disk (e.g. a new embedding to .obsm) rather than to modify the dataset and delete or alter existing elements.
Views¶
Views for shadow objects are conceptually similar to views in AnnData/MuData: they provide a view into an existing object without creating its copy.
As shadow objects inherently operate on the file they are connected to, their views behave slightly differently. Creating a view creates a new connection to the file and returns a new shadow object, which is aware of the part of the data (e.g. which cells) it is supposed to provide a view for.
[18]:
monocytes = mdata['rna'].obs['celltype'].values == "CD14 mono"
monocytes_view = mdata[monocytes]
monocytes_view
[18]:
View of MuData Shadow object with n_obs × n_vars = 612 × 17838 (original 3891 × 17838)
obs: _index, leiden, leiden_wnn, louvain
var: _index, feature_types, gene_ids, highly_variable
obsm: X_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna
varm: LFs, prot, rna
obsp: connectivities, distances, wnn_connectivities, wnn_distances
uns: leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn
obsmap: prot, rna
varmap: prot, rna
mod: 2 modalities
prot: 612 x 32
X
layers: counts
obs: _index
var: _index, feature_types, gene_ids, highly_variable
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: neighbors, pca, umap
rna: 612 x 17806
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_pca_copy, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Individual modalities of a MuData Shadow View are sliced accordingly:
[19]:
monocytes_view['rna'].obsm["X_pca"].shape
[19]:
(612, 50)
[20]:
monocytes_view['rna'].obsm
[20]:
obsm: X_pcaᐁ, X_pca_copy, X_umap
Cache is specific to each view:
[21]:
mdata['rna'].obsm # X_pca is not cached
[21]:
obsm: X_pca, X_pca_copy, X_umap
Moreover, this semantic allows to create views of views of views…
[22]:
adata = AnnDataShadow(file / "mod/rna")
[23]:
view = adata[3:10,:30]
view
[23]:
View of AnnData Shadow object with n_obs × n_vars = 7 × 30 (original 3891 × 17806)
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_pca_copy, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[24]:
nested_view = view[:2,-3:]
nested_view
[24]:
View of AnnData Shadow object with n_obs × n_vars = 2 × 3 (original 3891 × 17806)
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_pca_copy, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Getting attributes from views is no different than for shadow objects:
[25]:
nested_view.obs
[25]:
n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | leiden | celltype | |
---|---|---|---|---|---|---|
AAACCCATCGTGCATA-1 | 1908 | 6704.0 | 426.0 | 6.354415 | 2 | CD4+ memory T |
AAACGAAAGACAAGCC-1 | 1589 | 3900.0 | 363.0 | 9.307693 | 1 | CD14 mono |
… as they are shadow objects themselves:
[26]:
type(nested_view)
[26]:
shadows.anndatashadow.AnnDataShadow
[27]:
# Clean up
nested_view.close()
view.close()
del nested_view, view
monocytes_view.close()
mdata.close()
del monocytes_view, mdata
Per-feature access to datasets on disk¶
This is currently not possible as caching works at the level of individual HDF5 datasets.
Views may read only the necessary parts of the arrays to memory however this behaviour is currently not universal.
E.g.:
[28]:
adata_subset = adata[:10,:100]
adata_subset.X.shape
[28]:
(10, 100)
[29]:
adata_subset
[29]:
View of AnnData Shadow object with n_obs × n_vars = 10 × 100 (original 3891 × 17806)
X ᐁ
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_pca_copy, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[30]:
# Clean up
adata.close()
adata_subset.close()
del adata, adata_subset
In order to return the data to its original state, let’s manually remove the items we wrote to the file:
[31]:
import h5py
f = h5py.File(file, "a")
# ^
# ____________|
# if this works,
# no dangling read-only connections!
#
del f["mod/rna/obsm/X_pca_copy"]
f.close()