Shadows features

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import os
os.chdir("../../")
[3]:
from pathlib import Path
data = Path("data/")

Shadow objects and their features

While shadow objects provide a convenient read-only drop-in replacement for AnnData/MuData objects when needed, they also have additional features that can help users make the most of shadows.

Import classes for these shadow objects:

[4]:
from shadows import AnnDataShadow, MuDataShadow

Initialise a multimodal shadow object:

[5]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu"
mdata = MuDataShadow(file)

File

The file connection that the shadow is using can be accessed via the .file attribute:

[6]:
mdata.file
[6]:
<HDF5 file "pbmc5k_citeseq_processed.h5mu" (mode r)>

The name of the file can then be accessed via

[7]:
mdata.file.filename
[7]:
'data/pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu'

The connection stays open until mdata.close() is called

[8]:
mdata.close()

… or until the file has to be re-opened for modification (see below).

Permissions

We can open HDF5 files in different modes including purely read-only ('r') and read/write ('r+'). The mode can be provided to the constructor:

[9]:
mdata = MuDataShadow(file, mode="r")
mdata.file.mode
[9]:
'r'

Let’s add some data to the in-memory shadow object:

[10]:
mdata["rna"].obsm["X_pca_copy"] = mdata["rna"].obsm["X_pca"].copy()

We can also conveniently close and reopen the connection for a given in-memory shadow object:

[11]:
mdata.reopen(mode="r+")
mdata.file.mode
[11]:
'r+'

This way all the newly added elements are still available in memory:

[12]:
mdata["rna"].obsm
[12]:
obsm:   X_pcaᐁ, X_umap, X_pca_copy▲
[13]:
# Clean up
mdata.close()
del mdata

Individual modality access

Individual modalities stored in the .h5mu files can be accessed as part of the MuDataShadow object:

[14]:
mdata = MuDataShadow(file, mode="r")
mdata["rna"]
[14]:
AnnData Shadow object with n_obs × n_vars = 3891 × 17806
  X
  raw:  X, var, varm
  obs:  _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
  var:  _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
  obsm: X_pca, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap

Moreover, one can also create a direct connection to a specific modality:

[15]:
mdata.close()
del mdata

adata = AnnDataShadow(file / "mod/rna")
adata
[15]:
AnnData Shadow object with n_obs × n_vars = 3891 × 17806
  X
  raw:  X, var, varm
  obs:  _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
  var:  _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
  obsm: X_pca, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[16]:
# Clean up
adata.close()
del adata

Class identity

Many tools in the ecosystem including scanpy frequently check if the input object is an AnnData. For instance, in ``sc.pp.highly_variable_genes` <https://github.com/scverse/scanpy/blob/master/scanpy/preprocessing/_highly_variable_genes.py>`__ it reads:

if not isinstance(adata, AnnData):
    raise ValueError(
        '`pp.highly_variable_genes` expects an `AnnData` argument, '
        'pass `inplace=False` if you want to return a `pd.DataFrame`.'
    )

In order for shadow objects to be accepted by such functions, they mock their class identity:

[17]:
mdata = MuDataShadow(file, mode="r")

from mudata import MuData
assert isinstance(mdata, MuData), "mdata is not a valid MuData object"
[18]:
from anndata import AnnData
assert isinstance(mdata["rna"], AnnData), "mdata['rna'] is not a valid AnnData object"

Checking for shadow identity still works:

[19]:
isinstance(mdata, MuDataShadow)
[19]:
True
[20]:
isinstance(mdata["rna"], AnnDataShadow)
[20]:
True
[21]:
mdata.close()

Backends

AnnData/MuData are based on a NumPy/Pandas stack. This is the default for the shadow objects in order to provide compatibility with AnnData/MuData objects.

However the nature of shadow files also simplifies loading individual matrices or tables with alternative backends, e.g. JAX (Array), PyTorch (Tensor) or polars (DataFrame).

[22]:
mdata = MuDataShadow(file, array_backend="jax", table_backend="polars")
[23]:
obs = mdata["rna"].obs
print(type(obs))
obs.head()
<class 'polars.internals.dataframe.frame.DataFrame'>
[23]:
shape: (5, 7)
_index celltype leiden n_genes_by_counts pct_counts_mt total_counts total_counts_mt
object cat cat i32 f32 f32 f32
AAACCCAAGAGACAAG-1 "intermediate m... "3" 2363 6.332204 7375.0 467.0
AAACCCAAGGCCTAGA-1 "CD4+ naïve T" "0" 1259 9.093319 3772.0 343.0
AAACCCAGTCGTGCCA-1 "CD4+ memory T" "2" 1578 13.178295 4902.0 646.0
AAACCCATCGTGCATA-1 "CD4+ memory T" "2" 1908 6.354415 6704.0 426.0
AAACGAAAGACAAGCC-1 "CD14 mono" "1" 1589 9.307693 3900.0 363.0
[24]:
rna_pca = mdata["rna"].obsm["X_pca"]
print(type(rna_pca))
rna_pca
<class 'jaxlib.xla_extension.DeviceArray'>
[24]:
DeviceArray([[ 20.551052  ,   0.36840764,  -1.6193684 , ...,
                0.09656975,  -0.90912175,  -0.77955467],
             [ -9.47144   ,  -5.5212517 ,  -5.107428  , ...,
                0.64674896,  -0.892091  ,   1.7873902 ],
             [ -9.913012  ,   2.766899  ,  -2.0684972 , ...,
               -0.6454743 ,   1.615869  ,  -0.63476324],
             ...,
             [ -8.727723  ,   7.9196725 ,   1.3326805 , ...,
                1.4592032 ,   0.91210324,   1.3184382 ],
             [-10.792531  ,   3.2086673 ,  -2.0437238 , ...,
                1.7311838 ,  -1.840564  ,   1.3253008 ],
             [ 20.642431  ,   0.49294943,  -1.6694897 , ...,
               -0.51208967,   0.60652566,  -0.75145006]], dtype=float32)

When alternative backends are being used, not all of the AnnData/MuData features can be supported, and many external tools might not work as expected as they anticipate NumPy/Pandas objects instead.

[25]:
# Clean up
mdata.clear_cache()
mdata.close()
del mdata, rna_pca, obs

Partial writing

[!NOTE] This feature is experimental.

While the main use of the shadows is to provide a low-memory read-only solution to scverse datasets, ability to add new embeddings or other items to the file can greatly extend its usage patterns.

[9]:
mdata = MuDataShadow(file, mode="r")

Add a new embedding to the in-memory object:

[10]:
mdata["rna"].obsm["X_pca_copy"] = mdata["rna"].obsm["X_pca"].copy()
mdata["rna"].obsm
[10]:
obsm:   X_pcaᐁ, X_pca_copyᐁ, X_umap

For this, a family of methods is useful, including .reopen() and .write(). The .write() method will only work if the connection is not read-only, e.g. 'r+', however it is possible to reopen the file in another mode.

Internally, .write() pushes (._push_changes()) the in-memory changes (marked with ▲ in the object representation above) to the file and provides meaningful error messages when the file is not open for writing.

This separation of concern makes it transparent when the data is modified, and this workflow can be recommended when barely any data are added to the file. As the methods return the shadow itself, it is possible to chain them:

[11]:
mdata.reopen(mode='r+').write(clear_cache=True).reopen(mode='r');  # clear pushed elements from cache
mdata["rna"].obsm
[11]:
obsm:   X_pcaᐁ, X_pca_copyᐁ, X_umap
[12]:
mdata.file.mode
[12]:
'r'
[13]:
mdata.clear_cache()

Default mode is read-only, and it protects the files from being modified while also allowing for multiple connections to the file:

[17]:
try:
    mdata.write()
except OSError as e:
    print("Not available for .write():", e)
Not available for .write(): File is open in read-only mode. Changes can't be pushed. Reopen it with .reopen('r+') to enable writing.

[!NOTE] Partial writing is currently intended to add new elements to the dataset on disk (e.g. a new embedding to .obsm) rather than to modify the dataset and delete or alter existing elements.

Views

Views for shadow objects are conceptually similar to views in AnnData/MuData: they provide a view into an existing object without creating its copy.

As shadow objects inherently operate on the file they are connected to, their views behave slightly differently. Creating a view creates a new connection to the file and returns a new shadow object, which is aware of the part of the data (e.g. which cells) it is supposed to provide a view for.

[18]:
monocytes = mdata['rna'].obs['celltype'].values == "CD14 mono"
monocytes_view = mdata[monocytes]
monocytes_view
[18]:
View of MuData Shadow object with n_obs × n_vars = 612 × 17838 (original 3891 × 17838)
  obs:  _index, leiden, leiden_wnn, louvain
  var:  _index, feature_types, gene_ids, highly_variable
  obsm: X_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna
  varm: LFs, prot, rna
  obsp: connectivities, distances, wnn_connectivities, wnn_distances
  uns:  leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn
  obsmap:       prot, rna
  varmap:       prot, rna
  mod:  2 modalities
    prot: 612 x 32
        X
        layers: counts
        obs:    _index
        var:    _index, feature_types, gene_ids, highly_variable
        obsm:   X_pca, X_umap
        varm:   PCs
        obsp:   connectivities, distances
        uns:    neighbors, pca, umap
    rna: 612 x 17806
        X
        raw:    X, var, varm
        obs:    _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
        var:    _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
        obsm:   X_pca, X_pca_copy, X_umap
        varm:   PCs
        obsp:   connectivities, distances
        uns:    celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap

Individual modalities of a MuData Shadow View are sliced accordingly:

[19]:
monocytes_view['rna'].obsm["X_pca"].shape
[19]:
(612, 50)
[20]:
monocytes_view['rna'].obsm
[20]:
obsm:   X_pcaᐁ, X_pca_copy, X_umap

Cache is specific to each view:

[21]:
mdata['rna'].obsm  # X_pca is not cached
[21]:
obsm:   X_pca, X_pca_copy, X_umap

Moreover, this semantic allows to create views of views of views…

[22]:
adata = AnnDataShadow(file / "mod/rna")
[23]:
view = adata[3:10,:30]
view
[23]:
View of AnnData Shadow object with n_obs × n_vars = 7 × 30 (original 3891 × 17806)
  X
  raw:  X, var, varm
  obs:  _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
  var:  _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[24]:
nested_view = view[:2,-3:]
nested_view
[24]:
View of AnnData Shadow object with n_obs × n_vars = 2 × 3 (original 3891 × 17806)
  X
  raw:  X, var, varm
  obs:  _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
  var:  _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap

Getting attributes from views is no different than for shadow objects:

[25]:
nested_view.obs
[25]:
n_genes_by_counts total_counts total_counts_mt pct_counts_mt leiden celltype
AAACCCATCGTGCATA-1 1908 6704.0 426.0 6.354415 2 CD4+ memory T
AAACGAAAGACAAGCC-1 1589 3900.0 363.0 9.307693 1 CD14 mono

… as they are shadow objects themselves:

[26]:
type(nested_view)
[26]:
shadows.anndatashadow.AnnDataShadow
[27]:
# Clean up
nested_view.close()
view.close()
del nested_view, view

monocytes_view.close()
mdata.close()
del monocytes_view, mdata

Per-feature access to datasets on disk

This is currently not possible as caching works at the level of individual HDF5 datasets.

Views may read only the necessary parts of the arrays to memory however this behaviour is currently not universal.

E.g.:

[28]:
adata_subset = adata[:10,:100]
adata_subset.X.shape
[28]:
(10, 100)
[29]:
adata_subset
[29]:
View of AnnData Shadow object with n_obs × n_vars = 10 × 100 (original 3891 × 17806)
  X ᐁ
  raw:  X, var, varm
  obs:  _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
  var:  _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[30]:
# Clean up
adata.close()
adata_subset.close()
del adata, adata_subset

In order to return the data to its original state, let’s manually remove the items we wrote to the file:

[31]:
import h5py

f = h5py.File(file, "a")
#                    ^
#        ____________|
# if this works,
# no dangling read-only connections!
#

del f["mod/rna/obsm/X_pca_copy"]
f.close()