In markziemann/mitch: Multi-Contrast Gene Set Enrichment Analysis

Background

Applying enrichment analysis to methylation array data is difficult due to the presence of a variable number of probes per gene and the fact that a probe could belong to overlapping genes. There are existing over-representation based approaches to this, however they appear to lack sensitivity. To address this, we have developed a simple approach to aggregating differential methylation data to make it suitable for downstream use by mitch. The process begins with the differential probe methylation results from limma. Here, we summarise the limma t-statistics by gene using the arithmetic mean. The resulting gene level differential methylation scores then undergo mitch as usual.

Requirements

In addition to mitch v1.15.0 of higher, you will need an annotation set for the array you are using. These are conveniently provided as Bioconductor packages for HM450K and EPIC arrays.

HM450k: IlluminaHumanMethylation450kanno.ilmn12.hg19

EPIC: IlluminaHumanMethylationEPICanno.ilm10b4.hg19

One issue with these is that the annotations are quite old, which means that some of the gene names have changed (~12%), so it is a good idea to screen the list of gene names and update obsolete names using the HGNChelper package.

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mitch")

suppressPackageStartupMessages({
    library("mitch")
    library("HGNChelper")
    library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
    library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
})

Gene sets

In this cut down example we will be using a sample of 200 Reactome gene sets.

data(genesetsExample)
head(genesetsExample,3)

Probe gene relationship for HM450K array data

In order to get mitch working, we need a 2 column table that maps probes to genes. The workflow shown here is for the HM450k array, and an EPIC example is show at the end of the report.

anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
head(myann)
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
dim(gt1)
head(gt1)

Update deprecated gene symbols

Update old gene symbols using HGNChelper (13% of 21k names). Last updated 4th Sep 2024.

#new.hgnc.table <- getCurrentHumanMap()
new.hgnc.table <- readRDS("../inst/extdata/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
gt1$gene <- fix$Suggested.Symbol
head(gt1)

Importing profiling data

Here we will read in a table of differential probe methylation data generated by limma. We will use the t-statistics for downstream analysis.

x <- read.table("https://ziemann-lab.net/public/gmea/dma3a.tsv",header=TRUE,row.names=1)
head(x)

Now that the profiling data is loaded, we need to import with mitch package, which establishes the probe-gene relationships and aggregates the data to gene level scores. As you can see, the input was a table of 422k probes and the output is 19,380 gene scores. Many probes not annotated to genes are discarded.

y <- mitch_import(x,DEtype="limma",geneTable=gt1)
head(y)
dim(y)

Calculating enrichment

The mitch_calc function performs an enrichment test. If you imported multiple data tables in the previous step, mitch will conduct a multivariate enrichment test. The results can be prioritised by significance or effect size. My recommendation is to discard results with FDR>0.05 then prioritise by effect size, which for us is the mitch enrichment score called S distance. In this example I also set the minimum gene set size to 5.

res<-mitch_calc(y,genesetsExample,priority="effect",cores=2,minsetsize=5)
head(res$enrichment_result,10)

Downstream presentation

For presentation of the results you could consider making a volcano plot and/or making a barplot of S.dist values for selected gene sets that meet the FDR cutoff. You can also use the built in functions to make a set of charts and html report. It is vitally important to inspect the detailed resports to understand which genes are driving the enrichment in your dataset.

mitch_plots(res,outfile="methcharts.pdf")
mitch_report(res,"methreport.html")

Probe gene relationship for EPIC array data

Here is the code to create a probe-gene table for EPIC array data. Be sure to update the gene symbols with HGNChelper before conducting enrichment analysis.

anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt <- stack(gp2)
colnames(gt) <- c("gene","probe")
gt$probe <- as.character(gt$probe)
dim(gt)
str(gt)