[1]:
import os
os.chdir("../../")
[2]:
from pathlib import Path
data = Path("data/")
[ ]:
# We will only use them for plotting below
import scanpy as sc
import muon as mu
Throughout the notebook we will also track how much memory the notebook consumes after loading different objects into memory.
This won’t provide exact measurement but will allow to see the order of magnitude of memory consumption:
[4]:
import os, psutil
def measure_memory() -> float:
return psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2
Shadow objects¶
A lot of exploratory and downstream analyses and visualisations require read-only access to the data.
While AnnData (and MuData) provides support to delay reading the count matrix .X
into memory, it currently does not provide a lightweight read-only access to the file contents.
This is addressed by the new classes implemented in shadows
— AnnDataShadow and MuDataShadow. Briefly, they mimic AnnData and MuData interfaces while keeping a connection to the file open and loading respective arrays, matrices and tables only when they are requested.
Shadow objects are currently only implemented for H5AD and H5MU files.
Import classes for these shadow objects:
[5]:
from shadows import AnnDataShadow, MuDataShadow
First, let’s download a multimodal dataset in the H5MU format:
[6]:
import mudatasets
# Memory consumption before (in MiB)
mem_before = measure_memory()
mdata = mudatasets.load("pbmc5k_citeseq", files=["pbmc5k_citeseq_processed.h5mu"], data_dir=data, backed=False)
# This will return a MuData objects after downloading the file,
# but we will discard the in-memory object
# Memory consumption after (in MiB)
mem_after = measure_memory()
print(f"Loading MuData took about {(mem_after - mem_before):.2f} MiB of memory")
import gc
del mdata
gc.collect();
■ File pbmc5k_citeseq_processed.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu
■ Checksum is validated (md5) for pbmc5k_citeseq_processed.h5mu
■ Loading pbmc5k_citeseq_processed.h5mu...
Loading MuData took about 394.68 MiB of memory
Access and memory consumption¶
[7]:
mem_start = measure_memory()
Now we can initialise a new MuData-like (in reality, a MuDataShadow) object:
[8]:
file = data / "pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu"
# Memory consumption before (in MiB)
mem_before = measure_memory()
mdata = MuDataShadow(file)
# ^ This will traverse the file
# but will not load any matrices or data frames
# Memory consumption after (in MiB)
mem_after = measure_memory()
print(f"MuDataShadow took about {(mem_after - mem_before)} MiB of memory")
MuDataShadow took about 0.0 MiB of memory
[9]:
mdata = MuDataShadow(file)
print(mdata['rna'].raw)
mdata['rna'].raw.X[:,10]
print(mdata['rna'].raw)
raw: X, var, varm
raw: Xᐁ, var, varm
[10]:
mdata
[10]:
MuData Shadow object with n_obs × n_vars = 3891 × 17838
obs: _index, leiden, leiden_wnn, louvain
var: _index, feature_types, gene_ids, highly_variable
obsm: X_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna
varm: LFs, prot, rna
obsp: connectivities, distances, wnn_connectivities, wnn_distances
uns: leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn
obsmap: prot, rna
varmap: prot, rna
mod: 2 modalities
prot: 3891 x 32
X
layers: counts
obs: _index
var: _index, feature_types, gene_ids, highly_variable
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: neighbors, pca, umap
rna: 3891 x 17806
X
raw: Xᐁ, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Individual modalities are AnnDataShadow
objects:
[11]:
mdata['rna']
[11]:
AnnData Shadow object with n_obs × n_vars = 3891 × 17806
X
raw: Xᐁ, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Caching¶
[12]:
# Memory consumption before reading some of the attributes
mem_before = measure_memory()
We can use the values of the shadow object just as values in a regular MuData. They will be loaded from the object and cached:
[13]:
rna_obs = mdata.mod["rna"].obs
rna_obs
[13]:
n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | leiden | celltype | |
---|---|---|---|---|---|---|
AAACCCAAGAGACAAG-1 | 2363 | 7375.0 | 467.0 | 6.332204 | 3 | intermediate mono |
AAACCCAAGGCCTAGA-1 | 1259 | 3772.0 | 343.0 | 9.093319 | 0 | CD4+ naïve T |
AAACCCAGTCGTGCCA-1 | 1578 | 4902.0 | 646.0 | 13.178295 | 2 | CD4+ memory T |
AAACCCATCGTGCATA-1 | 1908 | 6704.0 | 426.0 | 6.354415 | 2 | CD4+ memory T |
AAACGAAAGACAAGCC-1 | 1589 | 3900.0 | 363.0 | 9.307693 | 1 | CD14 mono |
... | ... | ... | ... | ... | ... | ... |
TTTGGTTGTACGAGTG-1 | 1450 | 5666.0 | 367.0 | 6.477232 | 0 | CD4+ naïve T |
TTTGTTGAGTTAACAG-1 | 3068 | 10213.0 | 896.0 | 8.773132 | 9 | intermediate mono |
TTTGTTGCAGCACAAG-1 | 1649 | 4754.0 | 468.0 | 9.844342 | 4 | CD8+ memory T |
TTTGTTGCAGTCTTCC-1 | 1901 | 6373.0 | 553.0 | 8.677233 | 2 | CD4+ memory T |
TTTGTTGCATTGCCGG-1 | 3443 | 12220.0 | 1287.0 | 10.531915 | 3 | intermediate mono |
3891 rows × 6 columns
Individual matrices or data frames from .obsm
, .obsp
, .layers
, etc. can be loaded allowing to only load e.g. a particular embedding of the data:
[14]:
rna_pca = mdata["rna"].obsm["X_pca"]
rna_pca
[14]:
array([[ 20.551052 , 0.36840764, -1.6193684 , ..., 0.09656975,
-0.90912175, -0.77955467],
[ -9.47144 , -5.5212517 , -5.107428 , ..., 0.64674896,
-0.892091 , 1.7873902 ],
[ -9.913012 , 2.766899 , -2.0684972 , ..., -0.6454743 ,
1.615869 , -0.63476324],
...,
[ -8.727723 , 7.9196725 , 1.3326805 , ..., 1.4592032 ,
0.91210324, 1.3184382 ],
[-10.792531 , 3.2086673 , -2.0437238 , ..., 1.7311838 ,
-1.840564 , 1.3253008 ],
[ 20.642431 , 0.49294943, -1.6694897 , ..., -0.51208967,
0.60652566, -0.75145006]], dtype=float32)
Notice the symbol ᐁ
near the respective table (/mod/rna/obs
) and matrix (/mod/rna/obsm/X_pca
) indicating that the value has already been loaded into memory:
[15]:
mdata
[15]:
MuData Shadow object with n_obs × n_vars = 3891 × 17838
obs: _index, leiden, leiden_wnn, louvain
var: _index, feature_types, gene_ids, highly_variable
obsm: X_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna
varm: LFs, prot, rna
obsp: connectivities, distances, wnn_connectivities, wnn_distances
uns: leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn
obsmap: prot, rna
varmap: prot, rna
mod: 2 modalities
prot: 3891 x 32
X
layers: counts
obs: _index
var: _index, feature_types, gene_ids, highly_variable
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: neighbors, pca, umap
rna: 3891 x 17806
X
raw: Xᐁ, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pcaᐁ, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
Let’s load count matrices for a noticeable effect on memory consumption:
[16]:
rna_x = mdata["rna"].X
prot_x = mdata["prot"].X
[17]:
# Memory consumption after loading .obs, .obsm["X_pca"] for RNA,
# and count matrices for RNA and proteins
mem_after = measure_memory()
[18]:
print(f"Reading these few objects took about {(mem_after - mem_before):.2f} MiB of memory")
Reading these few objects took about 266.30 MiB of memory
Cache can be cleared:
[19]:
mdata.clear_cache()
mdata
[19]:
MuData Shadow object with n_obs × n_vars = 3891 × 17838
obs: _index, leiden, leiden_wnn, louvain
var: _index, feature_types, gene_ids, highly_variable
obsm: X_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna
varm: LFs, prot, rna
obsp: connectivities, distances, wnn_connectivities, wnn_distances
uns: leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn
obsmap: prot, rna
varmap: prot, rna
mod: 2 modalities
prot: 3891 x 32
X
layers: counts
obs: _index
var: _index, feature_types, gene_ids, highly_variable
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: neighbors, pca, umap
rna: 3891 x 17806
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: celltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
(No more obsᐁ
and X_pcaᐁ
!)
[20]:
del rna_obs, rna_pca, rna_x, prot_x
gc.collect()
[20]:
16
[21]:
print(f"After clearing cache, memory consumption change is {(measure_memory() - mem_after)} MiB")
After clearing cache, memory consumption change is -264.33984375 MiB
External tools¶
As AnnDataShadow and MuDataShadow mimic AnnData and MuData interfaces, most of the read-only functionality of the whole ecosystem of scverse tools, including scanpy and muon, is supported out of the box:
[22]:
sc.pl.umap(mdata["rna"], color="celltype")

Notice that the colours were fetched from the respective array in .uns
:
[23]:
mdata['rna'].uns['celltype_colors']
[23]:
array(['#8000ff', '#5641fd', '#2c7ef7', '#00b5eb', '#2adddd', '#54f6cb',
'#80ffb4', '#abf69b', '#d4dd80', '#ffb360', '#ff7e41', '#ff4121',
'#ff0000'], dtype=object)
Moreoever, they were only loaded from file when requested. We can see that an array with these value is cached now:
[24]:
mdata['rna'].uns
[24]:
uns: celltype_colorsᐁ, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
The same works for muon:
[25]:
mdata.obs['celltype'] = mdata['rna'].obs.celltype
mdata.uns['celltype_colors'] = mdata['rna'].uns['celltype_colors']
mu.pl.mofa(mdata, color="celltype")

Adding data to the in-memory object¶
Shadow objects also allow to have new data added to them, for instance new embeddings:
[!NOTE] Data added to in-memory shadow objects can be saved on disk as demonstrated in another tutorial.
[26]:
# For simplicity, we will copy an embedding
mdata.obsm["X_mofa_copy"] = mdata.obsm["X_mofa"].copy()
Notice a new symbol (▲
) to signify an in-memory object:
[27]:
mdata
[27]:
MuData Shadow object with n_obs × n_vars = 3891 × 17838
obs: _index, leiden, leiden_wnn, louvain
var: _index, feature_types, gene_ids, highly_variable
obsm: X_mofaᐁ, X_mofa_umapᐁ, X_umapᐁ, X_wnn_umapᐁ, protᐁ, rnaᐁ, X_mofa_copy▲
varm: LFs, prot, rna
obsp: connectivitiesᐁ, distancesᐁ, wnn_connectivitiesᐁ, wnn_distancesᐁ
uns: leiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn, celltype_colors▲
obsmap: prot, rna
varmap: prot, rna
mod: 2 modalities
prot: 3891 x 32
X
layers: counts
obs: _index
var: _index, feature_types, gene_ids, highly_variable
obsm: X_pca, X_umap
varm: PCs
obsp: connectivities, distances
uns: neighbors, pca, umap
rna: 3891 x 17806
X
raw: X, var, varm
obs: _index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt
var: _index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts
obsm: X_pca, X_umapᐁ
varm: PCs
obsp: connectivities, distances
uns: celltype_colorsᐁ, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap
[!NOTE] Adding columns to data frames or altering objects otherwise is currently not being registered by shadow objects.
Wind down¶
[28]:
mdata.clear_cache()
mdata.close()
mdata
cannot be used after the connection is closed:
[29]:
try:
print(mdata)
except ValueError as e:
print("MuData is not available:", e)
MuData is not available: Invalid location identifier (invalid location identifier)
[30]:
del mdata
gc.collect()
[30]:
9378
[31]:
print(f"In the end, memory consumption is {(measure_memory() - mem_start)} MiB up")
In the end, memory consumption is -1.2421875 MiB up