The following is a tutorial on how to perform a GeneSet Enrichment
Analysis (GSEA) rank test on single cell data using Seurat, DESeq2,
and the fGSEA
packages. In a nutshell, we use methods to determine differentially
expressed genes (DEGs) and use their resulting statistical metrics to
rank the genes for a GSEA test.
This workflow details two approaches for generating DEGs for the rank
test. The first uses the FindMarkers() function from
Seurat and is ideal for simple comparisons where no batch
effects are present. The second uses DESeq2 in combination
with aggregation of the cell-based expression to do a pseudobulk RNASeq
differential expression analysis.
Both first require the following steps.
The following are the required packages.
# packages for doing the analysis
library(Seurat) # For single cell analysis
library(SeuratData) # To get the example dataset
library(fgsea) # Fast GSEA implementation
library(msigdbr) # For gene sets
library(DESeq2) # For pseudo-bulk differential analysis
library(dplyr) # Tidy data manipulation
library(ggplot2) # Plotting tools
library(cowplot) # Combine plots
First we load in the ifnb dataset, which was originally
generated as part of this paper and used
as an example in this Seurat
tutorial. In this experiment, PBMCs were split into a stimulated and
control group and the stimulated group was treated with interferon beta.
This dataset is annotated with cells already having their cell types
determined and can be pulled from the SeuratData package.
Ultimately, we are going to see what genesets are being affected when we
compare stimulated to control cells.
# get data for doing analysis
InstallData("ifnb")
LoadData("ifnb") # load the data
## An object of class Seurat
## 14053 features across 13999 samples within 1 assay
## Active assay: RNA (14053 features, 0 variable features)
## 2 layers present: counts, data
# Update old Seurat object to accommodate new features
ifnb <- UpdateSeuratObject(ifnb)
Here we use the msigdbr package to pull in the Hallmark
genesets. Other collections like KEGG/Reactome or GO can be pulled by
adjusting the collection and subcollection
parameters. Here we are pulling the genesets for humans, but mouse data
can be pulled in instead by either setting species to "Mus
musculus" or db_species to "MM". The function
msigdbr_species() lists the available options.
# Get gene sets from MSigDB (e.g., Hallmark pathways)
gene_sets <- msigdbr(species = "Homo sapiens", collection = "H")
# Or use other collections:
# "C2" for curated gene sets (KEGG, Reactome, Wikipathways etc)
# "C5" for GO terms
# "C8" for cell type signatures
# Convert to list format for fgsea
pathways <- gene_sets %>%
split(x = .$gene_symbol, f = .$gs_name)
In situations where we do not have technical or biological replicates
and thus are expecting no batch effects, we can split up the data by
cell type and perform a comparison using the Seurat
function FindMarkers. Ultimately we then have a list of
differentially expressed genes and their corresponding metrics that can
be fed into the GSEA rank test.
In this example, we are focusing specifically on one cell type (CD14
Mono). We then determine which genes show a difference between “STIM”
(stimulated cells) and “CTRL” (control cells) using the wilcox test in
FindMarkers(), although other tests can be used.
Idents(ifnb) <- "seurat_annotations" # ensure that the idents are the cell types
# focus on one cell
ifnb_cd14 <- subset(ifnb, idents = "CD14 Mono")
# label cells with conditions
Idents(ifnb_cd14) <- "stim"
table(Idents(ifnb_cd14))
##
## CTRL STIM
## 2215 2147
# find differentially expressed genes between markers
markers <- FindMarkers(
ifnb_cd14,
ident.1 = "STIM",
ident.2 = "CTRL",
test.use = "wilcox", # or "MAST", "DESeq2"
logfc.threshold = 0, # Get all genes
min.pct = 0.1
)
markers$gene_name <- rownames(markers) # make gene names a column in data frame
head(markers)
## p_val avg_log2FC pct.1 pct.2 p_val_adj gene_name
## IFIT1 0 70.73672 0.985 0.033 0 IFIT1
## CXCL10 0 337.63562 0.984 0.035 0 CXCL10
## RSAD2 0 83.41584 0.988 0.045 0 RSAD2
## TNFSF10 0 82.23645 0.989 0.047 0 TNFSF10
## IFIT3 0 72.10004 0.992 0.056 0 IFIT3
## IFIT2 0 65.41976 0.961 0.039 0 IFIT2
In order to perform the analysis and to give a directionality to the
pathway enrichment, fGSEA requires a value that indicates
both the significance of a gene’s differential expression and whether
the gene is up-regulated or down-regulated. To do this, we take the
-log10 of the p-value and multiply it by the sign of the
log2 fold change. Also, since we have p-values of 0, we are
replacing those values with the smallest non-zero value, to avoid
results of Inf due to the log10 of 0 being
undefined. The resulting stat value is then used to rank
the genes and then passed to fGSEA.
# find minimum non-zero p-value
min_nz_pval <- min(markers$p_val[markers$p_val >0])
# create "stat" value for ranking
markers$stat <- -log10(pmax(markers$p_val, min_nz_pval)) * sign(markers$avg_log2FC)
# Create ranked gene list (by log fold change or stat)
# Important: remove NAs and sort
ranked_genes <- markers %>%
filter(!is.na(stat) &
!is.infinite(stat)) %>%
arrange(desc(stat)) %>%
pull(stat, name = gene_name)
# Run GSEA
fgsea_results <- fgsea(
pathways = pathways,
stats = ranked_genes,
minSize = 15, # Minimal size of a gene set to test
maxSize = 500 # Maximal size of a gene set to test
)
# output results
fgsea_results %>%
arrange(padj) %>%
head(5)
## pathway pval padj log2err
## <char> <num> <num> <num>
## 1: HALLMARK_INTERFERON_GAMMA_RESPONSE 1.110446e-16 4.330741e-15 1.0476265
## 2: HALLMARK_INTERFERON_ALPHA_RESPONSE 1.302071e-11 2.539038e-10 0.8753251
## 3: HALLMARK_OXIDATIVE_PHOSPHORYLATION 1.708996e-03 2.221694e-02 0.4550599
## 4: HALLMARK_APICAL_JUNCTION 1.002038e-02 9.769874e-02 0.3807304
## 5: HALLMARK_IL2_STAT5_SIGNALING 2.715819e-02 2.118339e-01 0.3524879
## ES NES size leadingEdge
## <num> <num> <int> <list>
## 1: 0.8187772 1.939178 139 IFIT1, C....
## 2: 0.8431010 1.915748 79 CXCL10, ....
## 3: -0.5016359 -1.524877 91 ATP6V0B,....
## 4: -0.6778586 -1.710155 30 PFN1, VC....
## 5: 0.6530287 1.426891 53 CXCL10, ....
The results consist of genesets (that pass the given filtering
thresholds) and their corresponding metrics. Important here is the
ES value which shows the “enrichment score”. It reflects
the degree to which a geneset is overrepresented at the top or bottom of
a ranked list of genes (more
info). It also indicates the overall directionality of the geneset
with positive values indicating up-regulation and negative values
indicating down-regulation. NES is a normalized version of
that score, normalized to a mean enrichment of random samples of the
same size (more
info). pval and padj represent the raw
significance of the enrichment and a BH-adjusted version of that
p-value, respectively.
Using a table plot, we can better visualize the top (or whatever set of) genesets. It shows some of the metrics of the analysis, as well as visualizing the ranked gene list, showcasing why the geneset was called as being significant.
# Table plot of top pathways
topPathways <- fgsea_results %>%
filter(padj < 0.05) %>%
arrange(desc(abs(NES))) %>%
pull(pathway)
plotGseaTable(
pathways[topPathways],
ranked_genes,
fgsea_results,
gseaParam = 0.5,
pathwayLabelStyle = list(size=8)
)
Here we see our 3 signficant genesets (at an FDR of 0.05), which includes two up-regulated and one down-regulated pathway. To get more details on the rankings that generated them, we can use …
Through enrichment plots, we can better visualize how the enrichment score was generated by plotting it out in relation to the ranking of genes in a geneset. Here we showcase the difference between an up-regulated and down-regulated geneset.
# Plot enrichment for top pathway
p1 <- plotEnrichment(
pathways[["HALLMARK_INTERFERON_GAMMA_RESPONSE"]],
ranked_genes
) + labs(title = "IFN-γ Response")
p2 <- plotEnrichment(
pathways[["HALLMARK_OXIDATIVE_PHOSPHORYLATION"]],
ranked_genes
) + labs(title = "Oxidative Phosphorylation")
plot_grid(p1, p2, ncol=1)
In situations where we expect the comparison to have to deal with
batch effects, for instance due to biological or even technical
replicates, a pseudobulk analysis using an RNASeq differential
expression tool like DESeq2 is a better option. The
following section walks through doing that kind of analysis.
In order to create a set of “replicates” we will look at the
differential expression between stimulated and control cells across two
cell types: CD14 Mono and CD16 Mono. In the following code we aggregate
the expression for each cell type and pull out the resulting cell types
we want to work with. We further create metadata and use that to start
the process of running DESeq2. After running through that
workflow, and contrasting STIM to CTRL, we have a list of DEGs, ready
for use in fGSEA.
# aggregate cells to create "pseudobulk"
aggregate_ifnb <- AggregateExpression(ifnb, group.by = c("seurat_annotations", "stim"),
return.seurat = TRUE)
Idents(aggregate_ifnb) <- "seurat_annotations"
agg_ifnb_cdmono <- subset(aggregate_ifnb, idents=c("CD14 Mono",
"CD16 Mono"))
# create metadata for samples
sample_metadata <- agg_ifnb_cdmono@meta.data %>%
select(stim, seurat_annotations, orig.ident) %>%
distinct() %>%
as.data.frame()
rownames(sample_metadata) <- sample_metadata$orig.ident
head(sample_metadata)
## stim seurat_annotations orig.ident
## CD14 Mono_CTRL CTRL CD14 Mono CD14 Mono_CTRL
## CD14 Mono_STIM STIM CD14 Mono CD14 Mono_STIM
## CD16 Mono_CTRL CTRL CD16 Mono CD16 Mono_CTRL
## CD16 Mono_STIM STIM CD16 Mono CD16 Mono_STIM
# create DESeq data set
dds <- DESeqDataSetFromMatrix(
countData = agg_ifnb_cdmono[["RNA"]]$counts,
colData = sample_metadata,
design = ~ stim
)
# run DESeq
dds <- DESeq(dds)
# determine results of contrasting STIM vs CTRL samples
res <- results(dds, contrast = c("stim", "STIM", "CTRL"))
# reformat for extracting
res <- as.data.frame(res)
res$gene_name <- rownames(res)
head(res[order(res$padj),])
## baseMean log2FoldChange lfcSE stat pvalue padj
## CXCL11 5693.183 8.507819 0.5538471 15.36131 2.974939e-53 2.525723e-49
## ISG15 70388.943 6.560922 0.5134575 12.77793 2.178006e-37 9.245634e-34
## DEFB1 727.629 6.472731 0.5514765 11.73709 8.226540e-32 2.328111e-28
## IFIT1 7711.616 6.948470 0.6192648 11.22051 3.233826e-29 6.863795e-26
## RSAD2 5659.016 6.736654 0.6207118 10.85311 1.927504e-27 3.272902e-24
## OASL 2273.690 4.452249 0.4147449 10.73491 6.978685e-27 9.874840e-24
## gene_name
## CXCL11 CXCL11
## ISG15 ISG15
## DEFB1 DEFB1
## IFIT1 IFIT1
## RSAD2 RSAD2
## OASL OASL
Unlike with the FindMarkers() function,
DESeq2 gives us a signed ranking value we can use directly
into fGSEA: stat. We can then rank the genes
as we did with the previous analysis and run fGSEA.
# Create ranked list
ranked_genes <- res %>%
filter(!is.na(stat)) %>%
arrange(desc(stat)) %>%
pull(stat, name = gene_name)
# Run GSEA
fgsea_results <- fgsea(
pathways = pathways,
stats = ranked_genes,
minSize = 15,
maxSize = 500
)
# output results
fgsea_results %>%
arrange(padj) %>%
head(5)
## pathway pval padj log2err
## <char> <num> <num> <num>
## 1: HALLMARK_INTERFERON_GAMMA_RESPONSE 1.000000e-50 4.900000e-49 NA
## 2: HALLMARK_INTERFERON_ALPHA_RESPONSE 7.125911e-46 1.745848e-44 1.7624393
## 3: HALLMARK_INFLAMMATORY_RESPONSE 2.661566e-09 4.347225e-08 0.7749390
## 4: HALLMARK_COMPLEMENT 2.267307e-08 2.777451e-07 0.7337620
## 5: HALLMARK_OXIDATIVE_PHOSPHORYLATION 4.991389e-08 4.891561e-07 0.7195128
## ES NES size leadingEdge
## <num> <num> <int> <list>
## 1: 0.8661461 2.918061 191 CXCL11, ....
## 2: 0.9134726 2.886607 93 CXCL11, ....
## 3: 0.5901638 1.975323 169 CXCL11, ....
## 4: 0.5800029 1.936951 163 IRF7, SE....
## 5: -0.4120766 -1.935689 180 NDUFAB1,....
fgsea_results %>%
arrange(padj) %>%
tail(5)
## pathway pval padj log2err ES
## <char> <num> <num> <num> <num>
## 1: HALLMARK_MTORC1_SIGNALING 0.6652720 0.7244073 0.03456643 0.2716777
## 2: HALLMARK_HEDGEHOG_SIGNALING 0.7128571 0.7593478 0.04577045 0.3293457
## 3: HALLMARK_UV_RESPONSE_DN 0.7582781 0.7905453 0.03084682 0.2591534
## 4: HALLMARK_E2F_TARGETS 0.7895833 0.8060330 0.02583317 0.2504684
## 5: HALLMARK_NOTCH_SIGNALING 0.8443580 0.8443580 0.08679498 -0.2333848
## NES size leadingEdge
## <num> <int> <list>
## 1: 0.9096071 183 PSMA4, P....
## 2: 0.8109045 20 PML
## 3: 0.8300572 106 CACNA1A,....
## 4: 0.8381129 179 LMNB1, D....
## 5: -0.7869075 28 PPARD, A....
Here we also showcase both the top genesets and the bottom genesets, in terms of significance.
As before, a table plot is a nice balance between raw metrics and visualization of the ranked set of genes that led to the genesets being labeled as significant.
# Table plot of top pathways
topPathways <- fgsea_results %>%
filter(padj < 0.05) %>%
arrange(desc(abs(NES))) %>%
pull(pathway)
plotGseaTable(
pathways[topPathways],
ranked_genes,
fgsea_results,
gseaParam = 0.5,
pathwayLabelStyle = list(size=8)
)
For this set of enrichment plots, we are showing the contrast between something being significant (first plot) and not significant (second plot). As we can see, unlike the first plot, the second plot never reaches a high enrichment score in either the negative or positive direction. This is further demonstrated by the even distribution of the gene rankins.
# compare top pathway to bottom pathway
p1 <- plotEnrichment(
pathways[["HALLMARK_INFLAMMATORY_RESPONSE"]],
ranked_genes
) + labs(title = "Inflammatory Response")
p2 <- plotEnrichment(
pathways[["HALLMARK_NOTCH_SIGNALING"]],
ranked_genes
) + labs(title = "Notch Signaling")
plot_grid(p1, p2, ncol=1)
Finally, here are the package versions used to create this tutorial.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cowplot_1.2.0 ggplot2_4.0.0
## [3] dplyr_1.1.4 DESeq2_1.48.2
## [5] SummarizedExperiment_1.38.1 Biobase_2.68.0
## [7] MatrixGenerics_1.20.0 matrixStats_1.5.0
## [9] GenomicRanges_1.60.0 GenomeInfoDb_1.44.3
## [11] IRanges_2.42.0 S4Vectors_0.46.0
## [13] BiocGenerics_0.54.0 generics_0.1.4
## [15] msigdbr_25.1.1 fgsea_1.34.2
## [17] stxBrain.SeuratData_0.1.2 pbmc3k.SeuratData_3.1.4
## [19] ifnb.SeuratData_3.1.0 SeuratData_0.2.2.9002
## [21] Seurat_5.3.0 SeuratObject_5.2.0
## [23] sp_2.2-0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_2.0.0
## [4] magrittr_2.0.4 spatstat.utils_3.2-0 farver_2.1.2
## [7] rmarkdown_2.30 vctrs_0.6.5 ROCR_1.0-11
## [10] spatstat.explore_3.5-3 S4Arrays_1.8.1 htmltools_0.5.8.1
## [13] curl_7.0.0 SparseArray_1.8.1 sass_0.4.10
## [16] sctransform_0.4.2 parallelly_1.45.1 KernSmooth_2.23-26
## [19] bslib_0.9.0 htmlwidgets_1.6.4 ica_1.0-3
## [22] plyr_1.8.9 plotly_4.11.0 zoo_1.8-14
## [25] cachem_1.1.0 igraph_2.1.4 mime_0.13
## [28] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-3
## [31] R6_2.6.1 fastmap_1.2.0 GenomeInfoDbData_1.2.14
## [34] fitdistrplus_1.2-4 future_1.67.0 shiny_1.11.1
## [37] digest_0.6.37 colorspace_2.1-2 patchwork_1.3.2
## [40] tensor_1.5.1 RSpectra_0.16-2 irlba_2.3.5.1
## [43] labeling_0.4.3 progressr_0.16.0 spatstat.sparse_3.1-0
## [46] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
## [49] compiler_4.5.1 withr_3.0.2 S7_0.2.0
## [52] BiocParallel_1.42.2 fastDummies_1.7.5 MASS_7.3-65
## [55] DelayedArray_0.34.1 rappdirs_0.3.3 tools_4.5.1
## [58] lmtest_0.9-40 httpuv_1.6.16 future.apply_1.20.0
## [61] goftest_1.2-3 glue_1.8.0 nlme_3.1-168
## [64] promises_1.3.3 grid_4.5.1 Rtsne_0.17
## [67] cluster_2.1.8.1 reshape2_1.4.4 gtable_0.3.6
## [70] spatstat.data_3.1-8 tidyr_1.3.1 data.table_1.17.8
## [73] XVector_0.48.0 spatstat.geom_3.6-0 RcppAnnoy_0.0.22
## [76] ggrepel_0.9.6 RANN_2.6.2 pillar_1.11.1
## [79] stringr_1.5.2 limma_3.64.3 spam_2.11-1
## [82] babelgene_22.9 RcppHNSW_0.6.0 later_1.4.4
## [85] splines_4.5.1 lattice_0.22-7 survival_3.8-3
## [88] deldir_2.0-4 tidyselect_1.2.1 locfit_1.5-9.12
## [91] miniUI_0.1.2 pbapply_1.7-4 knitr_1.50
## [94] gridExtra_2.3 scattermore_1.2 xfun_0.53
## [97] statmod_1.5.0 UCSC.utils_1.4.0 stringi_1.8.7
## [100] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.5
## [103] codetools_0.2-20 tibble_3.3.0 cli_3.6.5
## [106] uwot_0.2.3 xtable_1.8-4 reticulate_1.43.0
## [109] jquerylib_0.1.4 Rcpp_1.1.0 globals_0.18.0
## [112] spatstat.random_3.4-2 png_0.1-8 spatstat.univar_3.1-4
## [115] parallel_4.5.1 assertthat_0.2.1 presto_1.0.0
## [118] dotCall64_1.2 listenv_0.9.1 viridisLite_0.4.2
## [121] scales_1.4.0 ggridges_0.5.7 purrr_1.1.0
## [124] crayon_1.5.3 rlang_1.1.6 fastmatch_1.1-6