Downstream: DGEA and GSEA#

Differential gene expression (DGE) analysis is a crucial tool for identifying genes that are significantly over- or underexpressed between different conditions (e.g., healthy vs. disease) within specific cell types. While many tools exist for DGE analysis, recent studies suggest that pseudobulk methods (which aggregate cell-type-specific expression values per individual) perform particularly well for single-cell data, helping to avoid issues like pseudoreplication and inflated false discovery rates. In this tutorial, we’ll show how to perform both, primarily using the packages decoupleR and pybiomart for gene symbol conversion. If you haven’t yet installed it, you can do so with

pip install decoupler pybiomart
import scanpy as sc
import pandas as pd
import numpy as np
import decoupler as dc
import matplotlib.pyplot as plt

# for differential expression analysis
from pydeseq2.dds import DeseqDataSet, DefaultInference
from pydeseq2.ds import DeseqStats
import adjustText

sc.settings.njobs = 1

Understanding Pseudobulk Analysis#

When working with single-cell data across multiple conditions, performing differential expression at the single-cell level can lead to inflated p-values because each cell is treated as an independent sample. However, cells from the same sample are not truly independent since they share the same environment. Additionally, uneven cell numbers between samples can bias results.

The pseudobulk approach addresses these issues by:

  1. Aggregating cells from the same sample and cell type

  2. Working with raw counts rather than normalized data

  3. Requiring multiple biological replicates per condition

  4. Accounting for sample-level variation

Generate Pseudobulks#

We’ll now create pseudobulk profiles by summing counts across cells from the same sample and cell type. This helps recover lowly expressed genes that might be affected by dropout in single-cell analysis and provides a more robust foundation for differential expression testing.

adata_raw = sc.datasets.ebi_expression_atlas("E-MTAB-9221", filter_boring=True)

# Rename meta-data
columns = [
    "Sample Characteristic[sex]",
    "Sample Characteristic[individual]",
    "Sample Characteristic[disease]",
    "Factor Value[inferred cell type - ontology labels]",
]
adata_raw.obs = adata_raw.obs[columns]
adata_raw.obs.columns = ["sex", "individual", "disease", "cell_type"]
adata_raw

adata_raw
AnnData object with n_obs × n_vars = 6807 × 20522
    obs: 'sex', 'individual', 'disease', 'cell_type'
adata_raw.to_df().head()
ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 ENSG00000001036 ENSG00000001084 ENSG00000001167 ENSG00000001460 ... ENSG00000289568 ENSG00000289604 ENSG00000289685 ENSG00000289690 ENSG00000289694 ENSG00000289695 ENSG00000289697 ENSG00000289700 ENSG00000289701 ENSG00000289716
SAMEA6979313-AAACCCAAGACTCAAA 0.0 0.000000 0.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0
SAMEA6979313-AAAGAACCACCTGCTT 0.0 0.000000 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
SAMEA6979313-AAAGGATGTCCCTCAT 0.0 1.082781 0.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
SAMEA6979313-AAAGGGTGTCCCTCAT 0.0 0.000000 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
SAMEA6979313-AAAGGTTGTCCCTCAT 0.0 0.000000 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 20522 columns

Preprocessing#

Aligning gene identifiers#

One common challenge in bioinformatics is dealing with different gene identifier systems. We’ll use ENSEBML IDs for the following steps since these are usually more precise than gene symbols.

annot = sc.queries.biomart_annotations(
    "hsapiens", ["ensembl_gene_id", "external_gene_name"], use_cache=False
).set_index("ensembl_gene_id")
annot.head()
external_gene_name
ensembl_gene_id
ENSG00000210049 MT-TF
ENSG00000211459 MT-RNR1
ENSG00000210077 MT-TV
ENSG00000210082 MT-RNR2
ENSG00000209082 MT-TL1
# Filter genes not in annotation
adata = adata_raw[:, adata_raw.var.index.intersection(annot.index.values)]

# Assign gene symbols
adata.var["gene_symbol"] = [
    annot.loc[ensembl_id, "external_gene_name"] for ensembl_id in adata.var.index
]
adata.var = (
    adata.var.reset_index()
    .rename(columns={"index": "ensembl_gene_id"})
    .set_index("gene_symbol")
)

# Remove genes with no gene symbol
adata = adata[:, ~pd.isnull(adata.var.index)]

# Remove duplicate genes
adata.var_names_make_unique()

adata
/var/folders/qg/qgc908995g3fc8qtss2fsbhhxyxxj4/T/ipykernel_69686/339012641.py:6: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var["gene_symbol"] = [
/Users/tim.treis/anaconda3/envs/spatialdata/envs/workshop_2025/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
AnnData object with n_obs × n_vars = 6807 × 19123
    obs: 'sex', 'individual', 'disease', 'cell_type'
    var: 'ensembl_gene_id'
# let's remove cells without an annotated celltype for easier downstream analysis
adata = adata[~adata.obs["cell_type"].isnull()]

We will need the raw counts for subsequent pseudo-bulk analysis, so let’s save them in another AnnData layer.

adata.X = np.round(adata.X)
adata.layers["counts"] = adata.X

# Normalize and log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers["normalized"] = adata.X
/var/folders/qg/qgc908995g3fc8qtss2fsbhhxyxxj4/T/ipykernel_69686/2867543197.py:1: ImplicitModificationWarning: Modifying `X` on a view results in data being overridden
  adata.X = np.round(adata.X)
/var/folders/qg/qgc908995g3fc8qtss2fsbhhxyxxj4/T/ipykernel_69686/2867543197.py:2: ImplicitModificationWarning: Setting element `.layers['counts']` of view, initializing view as actual.
  adata.layers["counts"] = adata.X
# Identify highly variable genes
sc.pp.highly_variable_genes(
    adata, flavor="seurat_v3", n_top_genes=2000, batch_key="individual"
)

# Scale the data
sc.pp.scale(adata, max_value=10)

# Generate PCA features
sc.tl.pca(adata, svd_solver="arpack", use_highly_variable=True)

# Compute distances in the PCA space, and find cell neighbors
sc.pp.neighbors(adata)

# Generate UMAP features
sc.tl.umap(adata)

# Visualize
sc.pl.umap(adata, color=["disease", "cell_type", "individual"], frameon=False)
/Users/tim.treis/anaconda3/envs/spatialdata/envs/workshop_2025/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:73: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found.
  warnings.warn(
../_images/6935bcda1485a1e850de20bfb33c6767775002c50bbcea985177173e375e91e7.png

We can see that we have fairly good mixing of disease conditions (COVID-19/normal) and individuals in our data while the cell types are mostly separated.

Assess the quality of our pseudo-bulks#

When generating pseudobulk profiles, we need to ensure that each aggregated sample has sufficient data for reliable analysis. Samples with too few cells or low total counts may introduce noise and bias into downstream analyses. While the specific thresholds depend on the dataset, requiring at least 10 cells and 1000 total counts per sample helps maintain statistical power while filtering out potentially unreliable measurements.

pdata = dc.get_pseudobulk(
    adata,
    sample_col="individual",  # so we have multiple samples
    groups_col="cell_type",  # defines contrasts
    layer="counts",  # use normalized counts
    mode="sum",  # sum up counts for each group
    min_cells=0,
    min_counts=0,
)
_, ax = plt.subplots(figsize=(4, 4))

ax.hlines(y=3, xmin=0, xmax=3, color="red", linewidth=2, linestyle="--")
ax.vlines(x=1, ymin=3, ymax=7, color="red", linewidth=2, linestyle="--")

dc.plot_psbulk_samples(pdata, groupby="individual", figsize=(12, 4), ax=ax)
../_images/5bd7a8d903a53ef5f2fa30b5499a786d716dc7bc8ca3de66dd16b3093d41eaa8.png
_, ax = plt.subplots(figsize=(4, 4))

ax.hlines(y=3, xmin=0, xmax=3, color="red", linewidth=2, linestyle="--")
ax.vlines(x=1, ymin=3, ymax=7, color="red", linewidth=2, linestyle="--")

dc.plot_psbulk_samples(pdata, groupby="cell_type", figsize=(12, 4), ax=ax)
../_images/f24c192a98730f7983f4d3963b390f05a1cc42d448e9dd239f7f3bbcf1f24866.png

We’ll filter out celltype-individual pseudo-bulks that don’t contain at least 10 cells and 1000 total counts.

pdata = dc.get_pseudobulk(
    adata,
    sample_col="individual",  # so we have multiple samples
    groups_col="cell_type",  # defines contrasts
    layer="counts",  # use normalized counts
    mode="sum",  # sum up counts for each group
    min_cells=10,
    min_counts=1000,
)
dc.plot_psbulk_samples(pdata, groupby=["individual", "cell_type"], figsize=(12, 4))
../_images/4f4980ff036b16c4f2b58aa57fd4d60e02513d8ba3adad39b130d4003b7b46ba.png

We can see that we now end up with 34 observations which represent the filtered individual_celltype combinations.

pdata
AnnData object with n_obs × n_vars = 34 × 18463
    obs: 'individual', 'cell_type', 'sex', 'disease', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'mean', 'std'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'psbulk_props', 'counts'
pdata.to_df().head(8)
gene_symbol 5S_rRNA A1BG A2M A2MP1 A4GALT AAAS AACS AAGAB AAK1 AAMDC ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3
Control #1_B cell 0.0 3.0 0.0 0.0 1.0 1.0 0.0 0.0 4.0 1.0 ... 0.0 0.0 1.0 0.0 3.0 0.0 1.0 1.0 28.0 0.0
Control #2_B cell 0.0 7.0 0.0 0.0 0.0 0.0 2.0 3.0 9.0 2.0 ... 1.0 0.0 3.0 1.0 2.0 1.0 8.0 2.0 60.0 1.0
Control #3_B cell 0.0 6.0 0.0 0.0 0.0 0.0 0.0 3.0 8.0 1.0 ... 0.0 0.0 2.0 0.0 2.0 0.0 2.0 0.0 25.0 1.0
SARS-CoV2 pos Mild_B cell 0.0 17.0 0.0 0.0 1.0 7.0 4.0 1.0 31.0 3.0 ... 2.0 0.0 1.0 3.0 6.0 4.0 34.0 3.0 201.0 14.0
SARS-CoV2 pos Severe #1_B cell 0.0 20.0 0.0 0.0 0.0 2.0 15.0 14.0 25.0 2.0 ... 5.0 0.0 6.0 4.0 9.0 0.0 52.0 8.0 275.0 17.0
SARS-CoV2 pos Severe #2_B cell 0.0 28.0 0.0 0.0 0.0 4.0 6.0 11.0 15.0 6.0 ... 1.0 1.0 5.0 8.0 14.0 1.0 27.0 8.0 211.0 25.0
SARS-CoV2 pos Severe #3_B cell 0.0 4.0 0.0 0.0 1.0 1.0 2.0 3.0 3.0 0.0 ... 0.0 0.0 1.0 1.0 3.0 0.0 8.0 0.0 49.0 0.0
Control #1_T cell 1.0 70.0 44.0 5.0 5.0 11.0 28.0 33.0 704.0 14.0 ... 14.0 4.0 12.0 14.0 38.0 3.0 99.0 58.0 822.0 45.0

8 rows × 18463 columns

# let's save the raw counts for later use
pdata.layers["counts"] = pdata.X.copy()

# Normalize, scale and compute pca
sc.pp.normalize_total(pdata, target_sum=1e4)
sc.pp.log1p(pdata)
sc.pp.scale(pdata, max_value=10)
sc.tl.pca(pdata)

# Return raw counts to X
dc.swap_layer(pdata, "counts", X_layer_key=None, inplace=True)
sc.pl.pca(pdata, color=["disease", "cell_type"], size=300)
../_images/fe4e260cf45c1eb529ca94998f6523aae9363e73b79ab8b79f7e2d190417b3a4.png

We see in the PCA that our data seems to cluster well by celltype whereas conditions are mixed. In the plot below we see that PC1 and PC2 explain the majority of the variance which explains why the above plots are separated so well.

sc.pl.pca_variance_ratio(pdata)
../_images/86861a8941ab29a22666e52df485e4345792e476e24ef723e0aff9decb931a5a.png

Denoising the pseudo-bulks#

After filtering low-quality samples, we need to ensure that the genes we analyze are reliably expressed. Since different cell types express distinct sets of genes, we perform this filtering separately for each cell type.

In the next steps, we’ll focus on T cells.

tcells = pdata[pdata.obs["cell_type"] == "T cell"].copy()

We expect a bimodal distribution in these genes with one cluster of genes being expressed very little and only by very few samples. We will adjust the min_count and min_total_count parameters to filter these out.

dc.plot_filter_by_expr(tcells, group="disease", min_count=1, min_total_count=1)
../_images/656fb4e0349e3a092945ee49d7e59879c6001f27c3391e6ad18b0620730a3c95.png

Once we’ve found sufficiently good values, we’ll use these to subset the genes in the T cell AnnData object.

genes = dc.filter_by_expr(tcells, group="disease", min_count=10, min_total_count=15)

# Filter by these genes
tcells = tcells[:, genes].copy()
tcells
AnnData object with n_obs × n_vars = 7 × 10393
    obs: 'individual', 'cell_type', 'sex', 'disease', 'psbulk_n_cells', 'psbulk_counts'
    var: 'ensembl_gene_id', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches', 'mean', 'std'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'psbulk_props', 'counts'

Differential expression analysis of the pseudo-bulks using pydeseq2#

In this dataset, a natural contrast to explore is how the gene expression of T cells differes between healthy patients and patienst suffering from COVID-19. To do so, we’ll use PyDESeq2, a Python implementation of the popular R packages.

More details can be found in the PyDESeq2 documentation

inference = DefaultInference(n_cpus=8)

dds = DeseqDataSet(
    adata=tcells,
    design_factors="disease",
    ref_level=["disease", "normal"],
    refit_cooks=True,
    inference=inference,
)
/var/folders/qg/qgc908995g3fc8qtss2fsbhhxyxxj4/T/ipykernel_69686/4050698110.py:2: DeprecationWarning: ref_level is deprecated and no longer has any effect. It will beremoved in a future release.
  dds = DeseqDataSet(
/var/folders/qg/qgc908995g3fc8qtss2fsbhhxyxxj4/T/ipykernel_69686/4050698110.py:2: DeprecationWarning: design_factors is deprecated and will soon be removed.Please consider providing a formulaic formula using the design argumentinstead.
  dds = DeseqDataSet(
dds.deseq2()
Using None as control genes, passed at DeseqDataSet initialization
Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.78 seconds.

Fitting dispersion trend curve...
... done in 0.28 seconds.

Fitting MAP dispersions...
... done in 0.98 seconds.

Fitting LFCs...
... done in 0.71 seconds.

Calculating cook's distance...
... done in 0.01 seconds.

Replacing 0 outlier genes.
stat_res = DeseqStats(
    dds,
    contrast=["disease", "COVID-19", "normal"],
    inference=inference,
)
stat_res.summary()
Running Wald tests...
Log2 fold change & Wald test p-value: disease COVID-19 vs normal
               baseMean  log2FoldChange     lfcSE      stat    pvalue  \
gene_symbol                                                             
A1BG          70.478830       -0.203063  0.259224 -0.783351  0.433421   
A2M           36.737832       -1.260093  0.355132 -3.548244  0.000388   
A2MP1         15.628439        0.601803  0.701386  0.858020  0.390881   
AAAS          18.307198        0.254639  0.453519  0.561474  0.574475   
AACS          24.634494        0.259215  0.383214  0.676423  0.498772   
...                 ...             ...       ...       ...       ...   
ZXDC          30.185259       -0.289675  0.373567 -0.775429  0.438086   
ZYG11B       101.897349        0.269827  0.298309  0.904523  0.365718   
ZYX           81.607418        0.303738  0.264875  1.146723  0.251496   
ZZEF1        820.875555        0.027797  0.224519  0.123809  0.901467   
ZZZ3          59.934519       -0.075166  0.303987 -0.247268  0.804701   

                 padj  
gene_symbol            
A1BG         0.847593  
A2M          0.031502  
A2MP1        0.828224  
AAAS         0.900516  
AACS         0.873693  
...               ...  
ZXDC         0.850122  
ZYG11B       0.813746  
ZYX          0.742821  
ZZEF1        0.983509  
ZZZ3         0.960365  

[10393 rows x 6 columns]
... done in 0.45 seconds.
results_df = stat_res.results_df
results_df
baseMean log2FoldChange lfcSE stat pvalue padj
gene_symbol
A1BG 70.478830 -0.203063 0.259224 -0.783351 0.433421 0.847593
A2M 36.737832 -1.260093 0.355132 -3.548244 0.000388 0.031502
A2MP1 15.628439 0.601803 0.701386 0.858020 0.390881 0.828224
AAAS 18.307198 0.254639 0.453519 0.561474 0.574475 0.900516
AACS 24.634494 0.259215 0.383214 0.676423 0.498772 0.873693
... ... ... ... ... ... ...
ZXDC 30.185259 -0.289675 0.373567 -0.775429 0.438086 0.850122
ZYG11B 101.897349 0.269827 0.298309 0.904523 0.365718 0.813746
ZYX 81.607418 0.303738 0.264875 1.146723 0.251496 0.742821
ZZEF1 820.875555 0.027797 0.224519 0.123809 0.901467 0.983509
ZZZ3 59.934519 -0.075166 0.303987 -0.247268 0.804701 0.960365

10393 rows × 6 columns

Visual inspection of differentially expressed genes#

Volcano plots visualize differential expression results by plotting statistical significance (-log10 p-value) against effect size (log2 fold change). They help quickly identify genes that show both significant and substantial expression changes between conditions, making them a valuable tool for prioritizing genes for further investigation.

Typically, one adds visual guiding lines at 1.3 for the y-axis (-log10(0.05)) and +/- 1 log2(2) foldchange for the x-axis.

Why the adjusted p-value?#

When performing differential expression analysis across thousands of genes, we use adjusted p-values (typically Benjamini-Hochberg correction) to control for multiple testing. This is crucial because testing many genes simultaneously increases the chance of false positives - if we used raw p-values and tested 10,000 genes at p < 0.05, we’d expect about 500 false positives by chance alone. The adjusted p-value (or FDR - False Discovery Rate) helps control this problem by adjusting the significance threshold based on the number of tests performed, giving us more confidence that our findings represent true biological differences rather than statistical artifacts.

dc.plot_volcano_df(results_df, x="log2FoldChange", y="padj", top=20, figsize=(12, 6))
../_images/5b903668de54f767048c7d1520218b6da1d9f61ac76fcdbbd7afecc1e989824a.png

Enrichment with Over Representation Analysis (ORA)#

Once we have identified differentially expressed genes, we can interpret their biological significance through enrichment analysis. This helps us understand which pathways or biological processes are altered between conditions. We’ll use:

  1. Over-Representation Analysis (ORA): Tests whether genes in a pathway appear more frequently than expected by chance in our set of differential genes

  2. MSigDB gene sets: A comprehensive collection of annotated gene sets for various biological processes

  3. Visualization of enrichment results to identify key pathways

This analysis helps translate our gene-level findings into biological insights about pathway regulation and cellular processes.

msigdb_all = dc.get_resource("MSigDB")
msigdb_all
genesymbol collection geneset
0 A1BG immunesigdb GSE25088_CTRL_VS_IL4_AND_ROSIGLITAZONE_STIM_MA...
1 A1BG tf_targets_legacy TGTTTGY_HNF3_Q6
2 A1BG positional chr19q13
3 A1BG cell_type_signatures GAO_LARGE_INTESTINE_ADULT_CI_MESENCHYMAL_CELLS
4 A1BG go_cellular_component GOCC_EXTERNAL_ENCAPSULATING_STRUCTURE
... ... ... ...
5522261 ZZZ3 go_biological_process GOBP_MACROMOLECULE_DEACYLATION
5522262 ZZZ3 go_biological_process GOBP_CELL_CYCLE
5522263 ZZZ3 tf_targets_gtrf ZNF507_TARGET_GENES
5522264 ZZZ3 immunesigdb GSE3982_NEUTROPHIL_VS_EFF_MEMORY_CD4_TCELL_DN
5522265 ZZZ3 immunesigdb GSE18893_CTRL_VS_TNF_TREATED_TCONV_24H_UP

5522266 rows × 3 columns

collection = "go_biological_process"

msigdb = msigdb_all[msigdb_all["collection"] == collection]

# Remove duplicated entries
msigdb = msigdb[~msigdb.duplicated(["geneset", "genesymbol"])]

if collection == "hallmark":
    msigdb.loc[:, "geneset"] = [
        name.split("HALLMARK_")[1] for name in msigdb["geneset"]
    ]
elif collection == "go_biological_process":
    msigdb.loc[:, "geneset"] = [name.split("GOBP_")[1] for name in msigdb["geneset"]]

msigdb
genesymbol collection geneset
73 A1CF go_biological_process MRNA_METABOLIC_PROCESS
78 A1CF go_biological_process EMBRYO_IMPLANTATION
84 A1CF go_biological_process MACROMOLECULE_CATABOLIC_PROCESS
87 A1CF go_biological_process MRNA_MODIFICATION
88 A1CF go_biological_process RNA_PROCESSING
... ... ... ...
5522215 ZZZ3 go_biological_process PROTEIN_ACYLATION
5522223 ZZZ3 go_biological_process HISTONE_H3_ACETYLATION
5522241 ZZZ3 go_biological_process REGULATION_OF_MULTICELLULAR_ORGANISMAL_DEVELOP...
5522261 ZZZ3 go_biological_process MACROMOLECULE_DEACYLATION
5522262 ZZZ3 go_biological_process CELL_CYCLE

1026039 rows × 3 columns

top_genes = results_df[results_df["padj"] < 0.05]

enr_pvals = dc.get_ora_df(
    df=top_genes, net=msigdb, source="geneset", target="genesymbol"
)

enr_pvals.head()
Term Set size Overlap ratio p-value FDR p-value Odds ratio Combined score Features
0 2_OXOGLUTARATE_METABOLIC_PROCESS 26 0.038462 0.189642 0.620302 7.044437 11.712207 L2HGDH
1 3_UTR_MEDIATED_MRNA_DESTABILIZATION 28 0.035714 0.202653 0.628345 6.549434 10.454592 ZC3H12A
2 ACIDIC_AMINO_ACID_TRANSPORT 88 0.022727 0.158084 0.597507 3.526646 6.505362 P2RX7;TNF
3 ACID_SECRETION 66 0.030303 0.098998 0.559738 4.698541 10.866114 P2RX7;TNF
4 ACTIN_FILAMENT_BASED_PROCESS 1274 0.003925 0.978775 0.999998 0.516375 0.011078 ACTG1;CORO1A;OPHN1;STMN1;TNF
dc.plot_dotplot(
    enr_pvals.sort_values("Combined score", ascending=False).head(15),
    x="Combined score",
    y="Term",
    s="Odds ratio",
    c="FDR p-value",
    scale=0.1,
    figsize=(3, 9),
)
../_images/645cf954a76139a6efc996d4d58f25e6028a36b5b4b5129bdac0febf70239fbe.png