The input of
mCSEA is tipically a matrix with the processed $\beta$-values for each probe
and sample. If you start from the raw methylation files (like .idat files) you
should first preprocess the data with any of the available packages for
that purpose (e.g. r BiocStyle::Biocpkg("minfi")
or
r BiocStyle::Biocpkg("ChAMP")
). Minfi
includes functions to get a matrix of $\beta$-values (getBeta()) or
M-values (getM()). ChAMP output class depends on the type of analysis
performed. For instance, champ.norm() function returns a matrix, while
champ.load() returns a list of results, and one of them is a $\beta$-values
matrix. So mCSEA is totally compatible with minfi and ChAMP outputs as long as
a matrix with the methylation values is obtained.
Here we provide a minimal example to read .idat files with
r BiocStyle::Biocpkg("minfi")
package. A samplesheet must
be provided in order to read the raw files and generate a
RGChannelSet object. For more information about this step,
read minfi's vignette.
library(minfi) minfiDataDir <- system.file("extdata", package = "minfiData") targets <- read.metharray.sheet(minfiDataDir, verbose = FALSE) RGset <- read.metharray.exp(targets = targets)
Different cell types proportions across samples are one of the major sources of variability in methylation data from tissues like blood or saliva (@mcgregor16). There are a lot of packages which can be used to estimate cell types proportions in each sample in order to correct for this bias (reviewed in @mcgregor16). Here we supply an example where blood reference data is used to estimate cell types proportions of each blood sample.
library(FlowSorted.Blood.450k) library(mCSEA) data(mcseadata) cellCounts = estimateCellCounts(RGset) print(cellCounts)
These proportions could be introduced as covariates in the linear models.
To run a mCSEA analysis, you must rank all the evaluated CpGs probes with some metric (e.g. t-statistic, Fold-Change...). You can use rankProbes() function for that aim, or prepare a ranked list with the same structure as the rankProbes() output.
We load sample data to show how rankProbes() works:
library(mCSEA) data(mcseadata)
We loaded to our R environment betaTest and phenoTest objects, in addition to exprTest, annotation objects and association objects (we will talk about these after). betaTest is a matrix with the $\beta$-values of 10000 EPIC probes for 20 samples. phenoTest is a dataframe with the explanatory variable and covariates associated to the samples. When you load your own data, the structure of your objects should be similar.
head(betaTest, 3) print(phenoTest)
rankProbes() function uses these two objects as input and apply a linear
model with r BiocStyle::Biocpkg("limma")
package. By default, rankProbes()
considers the first column of the phenotypes table as the explanatory variable
in the model (e.g. cases and controls) and does not take into account any
covariate to adjust the models. You can change this behaviour modifying
explanatory and covariates options.
By default, rankProbes() assumes that the methylation data object contains $\beta$-values and transform them to M-values before calculating the linear models. If your methylation data object contains M-values, you must specify it (typeInput = "M"). You can also use $\beta$-values for models calculation (typeAnalysis = "beta"), although we do not recommend it due to it has been proven that M-values better accomplish the statistical assumptions of limma analysis (@du10).
myRank <- rankProbes(betaTest, phenoTest, refGroup = "Control")
myRank is a named vector with the t-values for each CpG probe.
head(myRank)
You can also supply rankProbes() function with a SummarizedExperiment object. In that case, if you don't specify a pheno object, phenotypes will be extracted from the SummarizedExperiment object with colData() function.
It is possible to take into account paired samples in rankProbes() analysis. For that aim, you should use paired = TRUE parameter and specify the column in pheno containing pairing information (pairColumn parameter).
Once you calculated a score for each CpG, you can perform the mCSEA analysis. For that purpose, you should use mCSEATest() function. This function takes as input the vector generated in the previous step, the methylation data and the phenotype information. By default, it searches for differentially methylated promoters, gene bodies and CpG Islands. You can specify the regions you want to test with regionsTypes option. minCpGs option specifies the minimum amount of CpGs in a region to be considered in the analysis (5 by default). You can increase the number of processors to use with nproc option (recommended if you have enough computational resources). Finally, you should specify if the array platform is 450k or EPIC with the platform option.
myResults <- mCSEATest(myRank, betaTest, phenoTest, regionsTypes = "promoters", platform = "EPIC")
mCSEATest() returns a list with the GSEA results and the association objects for each region type analyzed, in addition to the input data (methylation, phenotype and platform).
ls(myResults)
promoters is a data frame with the following columns (partially extracted
from r BiocStyle::Biocpkg("fgsea")
help):
head(myResults[["promoters"]][,-7])
On the other hand, promoters_association is a list with the CpG probes associated to each feature:
head(myResults[["promoters_association"]], 3)
You can also provide a custom association object between CpG probes and regions (customAnnotation option). This object should be a list with a structure similar to this:
head(assocGenes450k, 3)
Once you found some DMRs, you can make a plot with the genomic context of the interesting ones. For that, you must provide mCSEAPlot() function with the mCSEATest() results, and you must specify which type of region you want to plot and the name of the DMR to be plotted (e.g. gene name). There are some graphical parameters you can adjust (see mCSEAPlot() help). Take into account that this function connects to some online servers in order to get genomic information. For that reason, this function could take some minutes to finish the plot, specially the first time it is executed.
mCSEAPlot(myResults, regionType = "promoters", dmrName = "CLIC6", transcriptAnnotation = "symbol", makePDF = FALSE)
You can also plot the GSEA results for a DMR with mCSEAPlotGSEA() function.
mCSEAPlotGSEA(myRank, myResults, regionType = "promoters", dmrName = "CLIC6")
If you have both methylation and expression data for the same samples, you can integrate them in order to discover significant associations between methylation changes in a DMR and an expression alterations in a close gene. mCSEAIntegrate() considers the DMRs identified by mCSEATest() passing a P-value threshold (0.05 by default). It calculates the mean methylation for each condition using the leading edge CpGs and performs a correlation test between this mean DMR methylation and the expression of close genes. This function automatically finds the genes located within a determined distance (1.5 kb) from the DMR. Only correlations passing thresholds (0.5 for correlation value and 0.05 por P-value by default) are returned. For promoters, only negative correlations are returned due to this kind of relationship between promoters methylation and gene expression has been largely observed (@jones12). On the contrary, only positive correlations between gene bodies methylation and gene expression are returned, due to this is a common relationship observed (@jones12). For CpG islands and custom regions, both positive and negative correlations are returned, due to they can be located in both promoters and gene bodies.
To test this function, we extracted a subset of 100 genes expression from bone
marrows of 10 healthy and 10 leukemia patients (exprTest). Data was
extracted from r BiocStyle::Biocpkg("leukemiasEset")
package.
# Explore expression data head(exprTest, 3) # Run mCSEAIntegrate function resultsInt <- mCSEAIntegrate(myResults, exprTest, "promoters", "ENSEMBL") resultsInt
It is very important to specify the correct gene identifiers used in the expression data (geneIDs parameter). mCSEAIntegrate() automatically generates correlation plots for the significant results and save them in the directory specified by folder parameter (current directory by default).
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.