{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "4aa723fb-6a8d-4d43-913c-a31f2316b02f", "metadata": {}, "outputs": [], "source": [ "import os\n", "os.chdir(\"../../\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "f1c3418a-3a90-41b0-baa6-c6ad340dc75f", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "data = Path(\"data/\")" ] }, { "cell_type": "code", "execution_count": null, "id": "182d566e-9678-4185-9d11-1bc7578673e3", "metadata": {}, "outputs": [], "source": [ "# We will only use them for plotting below\n", "import scanpy as sc\n", "import muon as mu" ] }, { "cell_type": "markdown", "id": "43da885f-111d-4e07-a330-bd067d7e60b6", "metadata": {}, "source": [ "Throughout the notebook we will also track how much memory the notebook consumes after loading different objects into memory.\n", "\n", "This won't provide exact measurement but will allow to see the order of magnitude of memory consumption:" ] }, { "cell_type": "code", "execution_count": 4, "id": "2f08adba-2d59-47f9-a8f0-aee2f31f1ed7", "metadata": {}, "outputs": [], "source": [ "import os, psutil\n", "def measure_memory() -> float:\n", " return psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2" ] }, { "cell_type": "markdown", "id": "697198ed-b0ea-4ad1-a651-b2d6cc884806", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "934b8d69-b812-422f-b718-080bb8508348", "metadata": {}, "source": [ "## Shadow objects" ] }, { "cell_type": "markdown", "id": "e70decb3-d905-4903-8e06-3e581cd7b0e5", "metadata": {}, "source": [ "A lot of exploratory and downstream analyses and visualisations require read-only access to the data.\n", "\n", "While AnnData (and MuData) [provides](https://anndata.readthedocs.io/en/latest/generated/anndata.read_h5ad.html) support to delay reading the count matrix `.X` into memory, it currently does not provide a lightweight read-only access to the file contents.\n", "\n", "This is addressed by the new classes implemented in `shadows` — *AnnDataShadow* and *MuDataShadow*.\n", "Briefly, they mimic AnnData and MuData interfaces while keeping a connection to the file open and loading respective arrays, matrices and tables only when they are requested.\n", "\n", "Shadow objects are currently only implemented for H5AD and H5MU files." ] }, { "cell_type": "markdown", "id": "65462d07-01b0-4395-8891-eda01e472f38", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "4a38075c-8da2-4193-af1a-c52e18176f92", "metadata": {}, "source": [ "Import classes for these shadow objects:" ] }, { "cell_type": "code", "execution_count": 5, "id": "079454ed-10dc-47ef-9de2-ef70f95dbed6", "metadata": { "scrolled": true }, "outputs": [], "source": [ "from shadows import AnnDataShadow, MuDataShadow" ] }, { "cell_type": "markdown", "id": "67b6edb8-0b0f-4ae4-82f2-9f3e48640b65", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "c3221fae-4a1f-4b53-a680-200de544078b", "metadata": {}, "source": [ "First, let's download a multimodal dataset in the H5MU format:" ] }, { "cell_type": "code", "execution_count": 6, "id": "7397ecfb-1017-4ff0-a2f1-36d2d44b1491", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "■ File pbmc5k_citeseq_processed.h5mu from pbmc5k_citeseq has been found at data/pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu\n", "■ Checksum is validated (md5) for pbmc5k_citeseq_processed.h5mu\n", "■ Loading pbmc5k_citeseq_processed.h5mu...\n", "Loading MuData took about 394.68 MiB of memory\n" ] } ], "source": [ "import mudatasets\n", "\n", "# Memory consumption before (in MiB)\n", "mem_before = measure_memory()\n", "\n", "mdata = mudatasets.load(\"pbmc5k_citeseq\", files=[\"pbmc5k_citeseq_processed.h5mu\"], data_dir=data, backed=False)\n", "# This will return a MuData objects after downloading the file,\n", "# but we will discard the in-memory object\n", "\n", "# Memory consumption after (in MiB)\n", "mem_after = measure_memory()\n", "\n", "print(f\"Loading MuData took about {(mem_after - mem_before):.2f} MiB of memory\")\n", "\n", "import gc\n", "del mdata\n", "gc.collect();" ] }, { "cell_type": "markdown", "id": "15b92533-14c1-4999-890a-0ae64bed01fc", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "b8ae6d73-9a74-48ed-9d41-7e92bfee8f71", "metadata": {}, "source": [ "### Access and memory consumption" ] }, { "cell_type": "code", "execution_count": 7, "id": "a81a45da-352f-4d4f-9e44-530b34e6c9f3", "metadata": {}, "outputs": [], "source": [ "mem_start = measure_memory()" ] }, { "cell_type": "markdown", "id": "14f6f1ef-578a-499e-913a-036ac0772574", "metadata": {}, "source": [ "Now we can initialise a new MuData-like (in reality, a MuDataShadow) object:" ] }, { "cell_type": "code", "execution_count": 8, "id": "e5ab4c4b-ff9b-4691-acd4-ee87a3fa38e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MuDataShadow took about 0.0 MiB of memory\n" ] } ], "source": [ "file = data / \"pbmc5k_citeseq/pbmc5k_citeseq_processed.h5mu\"\n", "\n", "# Memory consumption before (in MiB)\n", "mem_before = measure_memory()\n", "\n", "mdata = MuDataShadow(file)\n", "# ^ This will traverse the file \n", "# but will not load any matrices or data frames\n", "\n", "# Memory consumption after (in MiB)\n", "mem_after = measure_memory()\n", "\n", "print(f\"MuDataShadow took about {(mem_after - mem_before)} MiB of memory\")" ] }, { "cell_type": "code", "execution_count": 9, "id": "07fe84d4-6a52-476b-b008-c3d492c0c81d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "raw:\tX, var, varm\n", "\n", "raw:\tXᐁ, var, varm\n", "\n" ] } ], "source": [ "mdata = MuDataShadow(file)\n", "print(mdata['rna'].raw)\n", "mdata['rna'].raw.X[:,10]\n", "print(mdata['rna'].raw)" ] }, { "cell_type": "code", "execution_count": 10, "id": "58e62721-16f4-4e50-bf52-1b159c5dffbc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MuData Shadow object with n_obs × n_vars = 3891 × 17838\n", " obs:\t_index, leiden, leiden_wnn, louvain\n", " var:\t_index, feature_types, gene_ids, highly_variable\n", " obsm:\tX_mofa, X_mofa_umap, X_umap, X_wnn_umap, prot, rna\n", " varm:\tLFs, prot, rna\n", " obsp:\tconnectivities, distances, wnn_connectivities, wnn_distances\n", " uns:\tleiden, leiden_wnn_colors, louvain, neighbors, rna:celltype_colors, umap, wnn\n", " obsmap:\tprot, rna\n", " varmap:\tprot, rna\n", " mod:\t2 modalities\n", " prot: 3891 x 32\n", " X \n", " layers:\tcounts\n", " obs:\t_index\n", " var:\t_index, feature_types, gene_ids, highly_variable\n", " obsm:\tX_pca, X_umap\n", " varm:\tPCs\n", " obsp:\tconnectivities, distances\n", " uns:\tneighbors, pca, umap\n", " rna: 3891 x 17806\n", " X \n", " raw:\tXᐁ, var, varm\n", " obs:\t_index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt\n", " var:\t_index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts\n", " obsm:\tX_pca, X_umap\n", " varm:\tPCs\n", " obsp:\tconnectivities, distances\n", " uns:\tcelltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata" ] }, { "cell_type": "markdown", "id": "0933f525-9360-423f-a538-86d7121585f4", "metadata": {}, "source": [ "Individual modalities are `AnnDataShadow` objects:" ] }, { "cell_type": "code", "execution_count": 11, "id": "cdd77b19-bd5a-4dbf-893e-73ed4d2f4616", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData Shadow object with n_obs × n_vars = 3891 × 17806\n", " X \n", " raw:\tXᐁ, var, varm\n", " obs:\t_index, celltype, leiden, n_genes_by_counts, pct_counts_mt, total_counts, total_counts_mt\n", " var:\t_index, dispersions, dispersions_norm, feature_types, gene_ids, highly_variable, mean, mean_counts, means, mt, n_cells_by_counts, pct_dropout_by_counts, std, total_counts\n", " obsm:\tX_pca, X_umap\n", " varm:\tPCs\n", " obsp:\tconnectivities, distances\n", " uns:\tcelltype_colors, hvg, leiden, leiden_colors, neighbors, pca, rank_genes_groups, umap" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdata['rna']" ] }, { "cell_type": "markdown", "id": "bb50af6a-4ee2-4a8f-b022-9b0daa63e81e", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "id": "95330ef5-8434-4a44-a9e2-93105053e0c2", "metadata": {}, "source": [ "### Caching" ] }, { "cell_type": "code", "execution_count": 12, "id": "341a9b70-9e52-4fb8-9845-8c3941c227e1", "metadata": {}, "outputs": [], "source": [ "# Memory consumption before reading some of the attributes\n", "mem_before = measure_memory()" ] }, { "cell_type": "markdown", "id": "3f0054f4-5099-4f21-874d-2da7dfeacad9", "metadata": {}, "source": [ "We can use the values of the shadow object just as values in a regular MuData. They will be loaded from the object and cached:" ] }, { "cell_type": "code", "execution_count": 13, "id": "7ce95cda-84ae-437b-80c4-24f70f3fb6d4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | n_genes_by_counts | \n", "total_counts | \n", "total_counts_mt | \n", "pct_counts_mt | \n", "leiden | \n", "celltype | \n", "
---|---|---|---|---|---|---|
AAACCCAAGAGACAAG-1 | \n", "2363 | \n", "7375.0 | \n", "467.0 | \n", "6.332204 | \n", "3 | \n", "intermediate mono | \n", "
AAACCCAAGGCCTAGA-1 | \n", "1259 | \n", "3772.0 | \n", "343.0 | \n", "9.093319 | \n", "0 | \n", "CD4+ naïve T | \n", "
AAACCCAGTCGTGCCA-1 | \n", "1578 | \n", "4902.0 | \n", "646.0 | \n", "13.178295 | \n", "2 | \n", "CD4+ memory T | \n", "
AAACCCATCGTGCATA-1 | \n", "1908 | \n", "6704.0 | \n", "426.0 | \n", "6.354415 | \n", "2 | \n", "CD4+ memory T | \n", "
AAACGAAAGACAAGCC-1 | \n", "1589 | \n", "3900.0 | \n", "363.0 | \n", "9.307693 | \n", "1 | \n", "CD14 mono | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
TTTGGTTGTACGAGTG-1 | \n", "1450 | \n", "5666.0 | \n", "367.0 | \n", "6.477232 | \n", "0 | \n", "CD4+ naïve T | \n", "
TTTGTTGAGTTAACAG-1 | \n", "3068 | \n", "10213.0 | \n", "896.0 | \n", "8.773132 | \n", "9 | \n", "intermediate mono | \n", "
TTTGTTGCAGCACAAG-1 | \n", "1649 | \n", "4754.0 | \n", "468.0 | \n", "9.844342 | \n", "4 | \n", "CD8+ memory T | \n", "
TTTGTTGCAGTCTTCC-1 | \n", "1901 | \n", "6373.0 | \n", "553.0 | \n", "8.677233 | \n", "2 | \n", "CD4+ memory T | \n", "
TTTGTTGCATTGCCGG-1 | \n", "3443 | \n", "12220.0 | \n", "1287.0 | \n", "10.531915 | \n", "3 | \n", "intermediate mono | \n", "
3891 rows × 6 columns
\n", "