Shadows for zarr

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import os
os.chdir("../../")
[3]:
from pathlib import Path
data = Path("data/")

Shadows for zarr storage

Beyond H5AD and H5MU files, shadow objects also work with Zarr files.

Import classes for these shadow objects:

[4]:
from shadows import AnnDataShadow, MuDataShadow

Initialise a multimodal shadow object:

[5]:
file = data / "pbmc5k_citeseq/minipbcite_prot.zarr"
adata = AnnDataShadow(file, format="zarr")

File

As with HDF5 files, file connection that the shadow is using can be accessed via the .file attribute:

[7]:
adata.file
[7]:
<zarr.hierarchy.Group '/' read-only>

The path to the file can then be accessed via adata.file.store.path:

[8]:
os.path.basename(adata.file.store.path)
[8]:
'minipbcite_prot.zarr'

Zarr store will be closed upon calling the adata.close() method:

[9]:
adata.close()

… or until the file has to be re-opened for modification (see below).

Permissions

We can open Zarr files in different modes including purely read-only ('r') and read/write ('r+'). The mode can be provided to the constructor:

[10]:
adata = AnnDataShadow(file, format="zarr", mode="r")
adata.file.read_only
[10]:
True

Let’s add some data to the in-memory shadow object:

[11]:
adata.obsm["X_pca_copy"] = adata.obsm["X_pca"].copy()

We can also conveniently close and reopen the connection for a given in-memory shadow object:

[12]:
adata.reopen(mode="r+")
adata.file.read_only
[12]:
False

This way all the newly added elements are still available in memory:

[13]:
adata.obsm
[13]:
obsm:   X_pcaᐁ, X_umap, X_pca_copy▲
[14]:
# Clean up
adata.close()
del adata

Individual modality access

Individual modalities stored in the .h5mu files can be accessed as part of the MuDataShadow object:

[15]:
adata = AnnDataShadow(file, format="zarr")
adata
[15]:
AnnData Shadow object with n_obs × n_vars = 411 × 29
  X
  layers:       counts
  obs:  _index
  var:  _index, feature_types, gene_ids, highly_variable
  obsm: X_pca, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  neighbors, pca, umap
[16]:
# Clean up
adata.close()
del adata

Class identity

Many tools in the ecosystem including scanpy frequently check if the input object is an AnnData. For instance, in ``sc.pp.highly_variable_genes` <https://github.com/scverse/scanpy/blob/master/scanpy/preprocessing/_highly_variable_genes.py>`__ it reads:

if not isinstance(adata, AnnData):
    raise ValueError(
        '`pp.highly_variable_genes` expects an `AnnData` argument, '
        'pass `inplace=False` if you want to return a `pd.DataFrame`.'
    )

In order for shadow objects to be accepted by such functions, they mock their class identity:

[17]:
adata = AnnDataShadow(file, format="zarr")

from anndata import AnnData
assert isinstance(adata, AnnData), "adata is not a valid AnnData object"

Checking for shadow identity still works:

[18]:
isinstance(adata, AnnDataShadow)
[18]:
True
[19]:
adata.close()

Backends

AnnData/MuData are based on a NumPy/Pandas stack. This is the default for the shadow objects in order to provide compatibility with AnnData/MuData objects.

However the nature of shadow files also simplifies loading individual matrices or tables with alternative backends, e.g. JAX (Array), PyTorch (Tensor) or polars (DataFrame).

[20]:
adata = AnnDataShadow(file, format="zarr", array_backend="jax", table_backend="polars")
[21]:
obs = adata.obs
print(type(obs))
obs.head()
<class 'polars.internals.dataframe.frame.DataFrame'>
[21]:
shape: (5, 1)
_index
object
CAGCCAGGTCTCGACG-1
TTCTTCCTCTCGGTAA-1
CGGGTCAAGAGAGGTA-1
TACCCGTCATAATCCG-1
TGGGTTAGTGAATTAG-1
[22]:
rna_pca = adata.obsm["X_pca"]
print(type(rna_pca))
rna_pca
<class 'jaxlib.xla_extension.ArrayImpl'>
[22]:
Array([[ 17.051027  ,   1.2865539 ,  -1.2715828 , ...,  -0.05060111,
         -1.8431426 ,  -1.0410113 ],
       [ 15.563506  ,  -2.1941857 ,  -1.351732  , ...,  -1.0639406 ,
         -0.1610156 ,   2.1454387 ],
       [ 20.369316  ,  -8.03503   ,   0.3842825 , ...,   0.52950376,
         -0.38589898,  -0.7488529 ],
       ...,
       [-11.894565  ,   9.380491  ,  -0.87732434, ...,  -0.40848297,
          0.4135897 ,  -0.710097  ],
       [-13.12094   ,   9.734974  ,  -3.345742  , ...,   1.049644  ,
          0.28707528,  -1.8128693 ],
       [-12.875325  ,  11.512296  ,  -4.9828258 , ...,  -0.82176274,
         -2.06324   ,  -0.14073044]], dtype=float32)

When alternative backends are being used, not all of the AnnData/MuData features can be supported, and many external tools might not work as expected as they anticipate NumPy/Pandas objects instead.

[23]:
# Clean up
adata.clear_cache()
adata.close()
del adata, rna_pca, obs

Partial writing

[!NOTE] This feature is experimental.

While the main use of the shadows is to provide a low-memory read-only solution to scverse datasets, ability to add new embeddings or other items to the file can greatly extend its usage patterns.

[24]:
adata = AnnDataShadow(file, format="zarr")

Add a new embedding to the in-memory object:

[25]:
adata.obsm["X_pca_copy"] = adata.obsm["X_pca"].copy()
adata.obsm
[25]:
obsm:   X_pcaᐁ, X_umap, X_pca_copy▲

For this, a family of methods is useful, including .reopen() and .write(). The .write() method will only work if the connection is not read-only, e.g. 'r+', however it is possible to reopen the file in another mode.

Internally, .write() pushes (._push_changes()) the in-memory changes (marked with ▲ in the object representation above) to the file and provides meaningful error messages when the file is not open for writing.

This separation of concern makes it transparent when the data is modified, and this workflow can be recommended when barely any data are added to the file. As the methods return the shadow itself, it is possible to chain them:

[26]:
adata.reopen(mode='r+').write(clear_cache=True).reopen(mode='r');  # clear pushed elements from cache
adata.obsm
[26]:
obsm:   X_pcaᐁ, X_pca_copy, X_umap
[27]:
adata.clear_cache()

Default mode is read-only, and it protects the files from being modified while also allowing for multiple connections to the file:

[28]:
try:
    adata.write()
except OSError as e:
    print("Not available for .write():", e)
Not available for .write(): File is open in read-only mode. Changes can't be pushed. Reopen it with .reopen('r+') to enable writing.

[!NOTE] Partial writing is currently intended to add new elements to the dataset on di not allow to delete or modify existing elements

Views

Views for shadow objects are conceptually similar to views in AnnData/MuData: they provide a view into an existing object without creating its copy.

As shadow objects inherently operate on the file they are connected to, their views behave slightly differently. Creating a view creates a new connection to the file and returns a new shadow object, which is aware of the part of the data (e.g. which cells) it is supposed to provide a view for.

[29]:
head = 100
head_view = adata[0:head]
head_view
[29]:
View of AnnData Shadow object with n_obs × n_vars = 100 × 29 (original 411 × 29)
  X
  layers:       counts
  obs:  _index
  var:  _index, feature_types, gene_ids, highly_variable
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  neighbors, pca, umap

Individual modalities of a MuData Shadow View are sliced accordingly:

[30]:
head_view.obsm["X_pca"].shape
[30]:
(100, 31)
[31]:
head_view.obsm
[31]:
obsm:   X_pcaᐁ, X_pca_copy, X_umap
[32]:
nested_view = head_view[:2,-3:]
nested_view
[32]:
View of AnnData Shadow object with n_obs × n_vars = 2 × 3 (original 411 × 29)
  X
  layers:       counts
  obs:  _index
  var:  _index, feature_types, gene_ids, highly_variable
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  neighbors, pca, umap

Getting attributes from views is no different than for shadow objects:

[33]:
nested_view.obs
[33]:
CAGCCAGGTCTCGACG-1
TTCTTCCTCTCGGTAA-1

… as they are shadow objects themselves:

[34]:
type(nested_view)
[34]:
shadows.anndatashadow.AnnDataShadow
[35]:
# Clean up
nested_view.close()
del nested_view

head_view.close()
del head_view

Per-feature access to datasets on disk

This is currently not possible as caching works at the level of individual HDF5 datasets.

Views may read only the necessary parts of the arrays to memory however this behaviour is currently not universal.

E.g.:

[36]:
adata_subset = adata[:10,:100]
adata_subset.X.shape
[36]:
(10, 29)
[37]:
adata_subset
[37]:
View of AnnData Shadow object with n_obs × n_vars = 10 × 29 (original 411 × 29)
  X ᐁ
  layers:       counts
  obs:  _index
  var:  _index, feature_types, gene_ids, highly_variable
  obsm: X_pca, X_pca_copy, X_umap
  varm: PCs
  obsp: connectivities, distances
  uns:  neighbors, pca, umap
[38]:
# Clean up
adata.close()
adata_subset.close()
del adata, adata_subset

In order to return the data to its original state, let’s manually remove the items we wrote to the file:

[39]:
import zarr

f = zarr.open(file, "a")
#                    ^
#        ____________|
# if this works,
# no dangling read-only connections!
#

del f["obsm/X_pca_copy"]
f.store.close()