knitr::opts_chunk$set(echo = TRUE)
coMethDMR
is an R package that identifies genomic regions that are both co-methylated and differentially
methylated in Illumina array datasets. Instead of testing all CpGs within a genomic region, coMethDMR
carries out an additional step that selects co-methylated sub-regions first without using any outcome information. Next, coMethDMR
tests association between methylation within the sub-region and continuous phenotype using a random coefficient mixed effects model, which models both variations between CpG sites within the region and differential methylation simultaneously.
The latest version can be installed by
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("coMethDMR")
After installation, the coMethDMR
package can be loaded into R using:
library(coMethDMR)
If you are running the coMethDMR
package for the first time, you will also need to download supplemental data sets from the sesameData
package. This happens automatically for the 450k and EPIC arrays when you load coMethDMR
for the first time. If you experience errors in this process, you may have your cache in an older default location. Please follow these steps to update where these genomic data are stored: https://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html#default-caching-location-update.
The input of coMethDMR
are methylation beta values. We assume quality control and normalization of the methylation dataset have been performed, by R packages such as minfi
or RnBeads
. For illustration, we use a subset of prefrontal cortex methylation data (GEO GSE59685) from a recent Alzheimer's disease epigenome-wide association study which was described in Lunnon et al. (2014). This example dataset contains beta values for 8552 CpGs on chromosome 22 for a random selection of 20 subjects.
data(betasChr22_df) betasChr22_df[1:5, 1:5]
If you have used either the minfi or RnBeads packages to pre-process your methylation data, we now show a quick example of the code necessary to create a methylation data frame similar to betasChr22_df
above (note that dataObject_minfi
and dataObject_RnBeads
are objects containing the pre-processed DNA methylation data as returned by the minfi::
or RnBeads::
packages, respectively):
### minfi ### betas_df <- as.data.frame( minfi::getMethSignal(dataObject_minfi, what = "Beta") ) ### RnBeads ### betas_df <- as.data.frame( RnBeads::meth(dataObject_RnBeads, row.names = TRUE) )
The corresponding phenotype dataset included variables stage
(Braak AD stage), subject.id
, slide
(batch effect), Sex
, Sample
and age.brain
(age of the brain donor). Please note the phenotype file needs to have a variable called "Sample" that will be used by coMethDMR to link to the methylation dataset.
data(pheno_df) head(pheno_df)
For illustration, suppose we are interested in identifying co-methylated genomic regions associated with AD stages (stage
treated as a linear variable). Here we demonstrate analysis of genomic regions mapped to CpG islands on chromosome 22. However the workflow can be similarly conducted for other types of genomic regions. See details in Section 2.1 below for gene based pipeline that tests genic and intergenic regions.
There are several steps: (1) obtain CpGs located closely (see details in Section 2.1 below) in genomic regions mapped to CpG islands, (2) identify co-methylated regions, and (3) test co-methylated regions against the outcome variable AD stage.
For the first step, we use the following commands:
CpGisland_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) )
Here, CpGisland_ls
is a list of 20 items, with each item of the list including a group of CpG probe IDs located closely within a particular CpG island region. Section 2.1 discusses how to import additional types of genomic regions.
Next, we identify co-methylated regions based on Mvalues.
system.time( coMeth_ls <- CoMethAllRegions( dnam = betasChr22_df, betaToM = TRUE, # converts beta to m-values method = "pearson", CpGs_ls = CpGisland_ls, arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs", nCores_int = 1 ) ) # ~15 seconds coMeth_ls
coMeth_ls
is list with that contains groups of CpG probeIDs corresponding to co-methylated regions. Three comethylated regions were identified in this example.
If we want to look at co-methylation within the first co-methylated region:
WriteCorrPlot <- function(beta_mat){ require(corrplot) require(coMethDMR) CpGs_char <- row.names(beta_mat) CpGsOrd_df <- OrderCpGsByLocation( CpGs_char, arrayType = c("450k"), output = "dataframe" ) betaOrdered_mat <- t(beta_mat[CpGsOrd_df$cpg ,]) corr <- cor( betaOrdered_mat, method = "spearman", use = "pairwise.complete.obs" ) corrplot(corr, method = "number", number.cex = 1, tl.cex = 0.7) } # subsetting beta values to include only co-methylated probes betas_df <- subset( betasChr22_df, row.names(betasChr22_df) %in% coMeth_ls[[1]] ) WriteCorrPlot(betas_df)
Next, we test these co-methylated regions against stage
using a random coefficient model (more details in section 2.3 below).
Some messages are generated during mixed models fitting, which are saved to the file specified by outLogFile
. The interpretations of these messages can be found in the FAQs at the end of this document (see Section 3, item (1) and (2)).
out_df <- lmmTestAllRegions( betas = betasChr22_df, region_ls = coMeth_ls, pheno_df, contPheno_char = "stage", covariates_char = NULL, modelType = "randCoef", arrayType = "450k" # generates a log file in the current directory # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt") ) out_df
Here out_df
is a data frame of genomic regions, with corresponding p-values and false discovery rate (FDRs) from the random coefficient mixed model.
We can annotate these results by adding corresponding genes and probes mapped to the genomic regions.
system.time( outAnno_df <- AnnotateResults( lmmRes_df = out_df, arrayType = "450k" ) ) # ~12 seconds outAnno_df
To further examine the significant regions, we can also extract individual CpG p-values within these significant regions. For example, for the most significant region chr22:18268062-18268249
,
outCpGs_df <- CpGsInfoOneRegion( regionName_char = "chr22:18268062-18268249", betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = NULL, arrayType = "450k" ) outCpGs_df
These CpGs mapped to intergenic regions, so there are no gene names associated with the probes. For genic regions such as chr22:19709548-19709755
, we would have results such as the following:
# library("GenoGAM") # NOTE 2022-03-28: what are we using this package for again?
CpGsInfoOneRegion( regionName_char = "chr22:19709548-19709755", betas_df = betasChr22_df, pheno_df = pheno_df, contPheno_char = "stage", covariates_char = NULL, arrayType = "450k" )
coMethDMR
workflowGenomic regions on the Illumina arrays can be defined based on their relations to genes or CpG Islands. To reduce redundancy in the tested genomic regions, we recommend first testing genic and intergenic regions, then add annotations to each genomic region for their relation to CpG islands.
In coMethDMR
package, for 450k arrays, the relevant genomic regions to be analyzed are in files 450k_Gene_3_200.RDS
and 450k_InterGene_3_200.rds
. For EPIC arrays, the relevant genomic regions are in files EPIC_Gene_3_200.rds
and EPIC_InterGene_3_200.rds
. These additional data sets are available at https://github.com/TransBioInfoLab/coMethDMR_data.
These files were created using the function WriteCloseByAllRegions
, briefly, for genic regions, within each gene, we identified clusters of CpGs located closely (i.e. the maximum separation between any two consecutive probes is 200bp; maxGap = 200
), and we required each cluster to have at least 3 CpGs (minCpGs = 3
). For intergenic regions, we identified clusters CpGs similarly for each chromosome. To extract clusters of close-by CpGs from pre-defined genomic regions with different values of maxGap
and minCpGs
, the WriteCloseByAllRegions
function can be used.
The pre-computed genomic regions can be accessed using the following commands. For geneic regions in 450k arrays,
gene_ls <- readRDS( system.file( "extdata", "450k_Gene_3_200.rds", package = 'coMethDMR', mustWork = TRUE ) )
Here gene_ls
is a list, with each item containing a character vector of CpGs IDs for a particular region in a gene.
Vignette # 2 illustrates how to leverage parallel computing via BiocParallel
R package to make gene-based analysis fast.
Before identifying co-methylated clusters, we recommend removing uninteresting technical and biological effects, so that the resulting co-methylated clusters are only driven by the biological factors we are interested in. This can be accomplished using the GetResiduals
function.
For example, the following script computes residuals from linear model
Mvalues ~ age.brain + sex + slide
(note that we use only a subset of the Chromosome 22 CpG islands for computing time consideration).
Cgi_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) betasChr22_df <- betasChr22_df[rownames(betasChr22_df) %in% unlist(Cgi_ls)[1:20], ]
resid_mat <- GetResiduals( dnam = betasChr22_df, # converts to Mvalues for fitting linear model betaToM = TRUE, pheno_df = pheno_df, covariates_char = c("age.brain", "sex", "slide") )
Within each genomic region, coMethDMR identifies contiguous and co-methylated CpGs sub-regions without using any outcome information. To select these co-methylated sub-regions, we use the rdrop
statistic, which is the correlation between each CpG with the sum of methylation levels in all other CpGs. The default is rDropThresh_num = 0.4
. We recommend this setting based on our simulation study. Note that higher rDropThresh_num
values lead to fewer co-methylated regions.
Again, for illustration, we use CpG islands. For example, if we are interested in identifying co-methylated sub-region within the first genomic region in Cgi_ls
:
Cgi_ls <- readRDS( system.file( "extdata", "CpGislandsChr22_ex.rds", package = 'coMethDMR', mustWork = TRUE ) ) coMeth_ls <- CoMethAllRegions( dnam = resid_mat, betaToM = FALSE, method = "pearson", CpGs_ls = Cgi_ls[1], arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs" ) coMeth_ls
The results (NULL
) indicate there is no co-methylated sub-region within the first genomic region.
What about the other 19 regions? Next we look at a region (5th region in Cgi_ls
) where there is a co-methylated sub-region:
coMeth_ls <- CoMethAllRegions( dnam = resid_mat, betaToM = FALSE, CpGs_ls = Cgi_ls[5], arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs" ) coMeth_ls
coMeth_ls
is a list, where each item is a list of CpG probe IDs for a co-methylated sub-region.
If we want to see the detailed output of the coMethDMR algorithm, that is, how the co-methylated region was obtained, we can specify output = "dataframe"
:
coMethData_df <- CoMethAllRegions( dnam = resid_mat, betaToM = FALSE, CpGs_ls = Cgi_ls[5], arrayType = "450k", returnAllCpGs = FALSE, output = "dataframe" ) [[1]] coMethData_df
coMethData_df
provides the details on how the co-methylated region was obtained: Here keep = 1
if rDropThresh_num > 0.4
(i.e. a co-methylated CpG), and keep_contigous
indicates if the probe is in a contiguous co-methylated region. Note that only the last 3 CpGs constitutes the co-methylated cluster.
To test association between a continuous phenotype and methylation values in a contiguous co-methylated region, two mixed models have been implemented in the function lmmTestAllRegions
: a random coefficient mixed model (modelType = "randCoef"
) and a simple linear mixed model (modelType = "simple"
).
The random coefficient mixed model includes both a systematic component that models the mean for each group of CpGs, and a random component that models how each CpG varies with respect to the group mean (random probe effects). It also includes random sample effects that model correlations between multiple probes within the same sample.
More specifically, the random coefficient model is
methylation M value ~ contPheno_char + covariates_char + (1|Sample) + (contPheno_char|CpG).
The last term (contPheno_char|CpG)
specifies both random intercepts and slopes for each CpG.
The simple linear mixed model includes all the terms in the random coefficient model except random probe effects.
The simple linear mixed model is
methylation M value ~ contPheno_char + covariates_char + (1|Sample)
To test one genomic region against the continuous phenotype stage
, adjusting for age.brain
:
lmmTestAllRegions( betas = betasChr22_df, region_ls = coMeth_ls[1], pheno_df, contPheno_char = "stage", covariates_char = "age.brain", modelType = "randCoef", arrayType = "450k" )
If we don't want to adjust for any covariate effect, we can set covariates_char
to NULL
.
Finally, we demonstrate coMethDMR
analysis for a particular gene, for example the ARFGAP3
gene.
data(betasChr22_df)
We assume that the user knows the set of probes corresponding to the gene of interest. If this is not the case, we provide two data sets which contain mappings from gene symbols to probe IDs for both 450k ("450k_CpGstoGene_min3CpGs.rds") and EPIC ("EPIC_CpGstoGene_min3CpGs.rds") arrays. These two data sets are available at: https://github.com/TransBioInfoLab/coMethDMR_data/tree/main/data.
# list probes for this gene ARFGAP3_CpGs_char <- c( "cg00079563", "cg01029450", "cg02351223", "cg04527868", "cg09861871", "cg26529516", "cg00539564", "cg05288033", "cg09367092", "cg10648908", "cg14570855", "cg15656623", "cg23778094", "cg27120833" ) # list probes located closely on this gene gene3_200 <- CloseBySingleRegion( CpGs_char = ARFGAP3_CpGs_char, arrayType = "450k", maxGap = 200, minCpGs = 3 ) CpGsOrdered_ls <- lapply( gene3_200, OrderCpGsByLocation, arrayType = "450k", output = "dataframe" ) names(gene3_200) <- lapply(CpGsOrdered_ls, NameRegion) # List of regions gene3_200
Now that we have a list of regions, we can find the co-methylated regions within the gene and test their statistical significance:
# co-methlyated region within the gene coMeth_ls <- CoMethAllRegions( dnam = betasChr22_df, betaToM = TRUE, method = "pearson", CpGs_ls = gene3_200, arrayType = "450k", returnAllCpGs = FALSE, output = "CpGs" ) coMeth_ls # test the co-methylated regions within the gene results <- lmmTestAllRegions( betas = betasChr22_df, region_ls = coMeth_ls, pheno_df, contPheno_char = "stage", covariates_char = "age.brain", modelType = "randCoef", arrayType = "450k" # generates a log file in the current directory # outLogFile = paste0("lmmLog_", Sys.Date(), ".txt") )
Before we inspect the results, we will annotate them:
AnnotateResults(lmmRes_df = results, arrayType = "450k")
(1) What happens when mixed model fails to coverge (i.e. the warning "Model failed to converge with..." is resulted for a particular genomic region)? - In this case, the p-value for mixed model is set to 1. In our experiences with methylation datasets, genomic regions with strong signals typically converge. Convergence issues typically occurs when the amount of noise in data is high.
(2) When fitting mixed models with lmmTestAllRegions
function, What does the message "boundary (singular) fit" mean?
- When mixed model is singular, at least one of the estimated variance components for intercepts or slopes random effects is 0, because there isn't enough variabilities in data to estimate the random effects. In this case, mixed model reduces to a fixed effects model. However, as our simulation studies have shown, the p-values obtained for these regions are still valid.
Lunnon K, Smith R, Hannon E, De Jager PL, Srivastava G, Volta M, Troakes C, Al-Sarraj S, Burrage J, Macdonald R, et al (2014) Methylomic profiling implicates cortical deregulation of ANK1 in Alzheimer's disease. Nat Neurosci 17:1164-1170.
utils::sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.