QSEA stands for "Quantitative Sequencing Enrichment Analysis" and implements a statistical framework for modeling and transformation of MeDIP-seq enrichment data to absolute methylation levels similar to BS-sequencing read-outs. Furthermore, QSEA comprises functionality for data normalization that accounts for the effect of CNVs on the read-counts as well as for the detection and annotation of differently methylated regions (DMRs). The transformation is based on a Bayesian model, that explicitly takes background reads and CpG density dependent enrichment profiles of the experiments into account.
To install the QSEA package in your R environment, start R and enter:
BiocManager::install("qsea")
Next, it is necessary to have the genome of interest
available in your R environment.
As soon as you have the r Biocpkg("BSgenome")
package installed and the
library loaded using
BiocManager::install("BSgenome") library("BSgenome")
you can list all available genomes by typing
available.genomes()
In case a genome of interest is not available as a BSgenome
package but the sequence of the genome is available,
a custom BSgenome package can be generated, please see the
"How to forge a BSgenome data package" manual of the r Biocpkg("BSgenome")
package.
In the given example, we mapped the short reads against the human genome build hg19. Therefore, we download and install this genome build:
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
This takes some time, but has to be done only once for each reference genome.
In order to specify genomic regions of interest, QSEA utilitzes the GenomicRanges package
BiocManager::install("GenomicRanges")
Finally, the QSEA work flow described below requires access to example data
available in the r Biocpkg("MEDIPSData")
package which can be installed by typing:
BiocManager::install("MEDIPSData")
Here, we show the most important steps of the QSEA work-flow for the analysis of MeDIP seq data. We assume that the reads have been aligned to reference genome hg19 and are on hand in .bam format.
In order to describe the samples, we must provide a data.frame object with at least 3 columns:
Further columns describing the samples are optional, and can be considered in downstream analysis:
To demonstrate the QSEA analysis steps,
we use data available in the r Biocpkg("MEDIPSData")
package.
This package contains aligned MeDIP seq data for chromosomes 20, 21 and 22 of 3 NSCLC
tumor samples and adjacent normal tissue.
data(samplesNSCLC, package="MEDIPSData") knitr::kable(samples_NSCLC) path=system.file("extdata", package="MEDIPSData") samples_NSCLC$file_name=paste0(path,"/",samples_NSCLC$file_name )
Now, the QSEA package has to be loaded.
library(qsea)
To access information of the reference genome, we load the pre-installed (see chapter Installation) hg19 library:
library(BSgenome.Hsapiens.UCSC.hg19)
All relevant information of the enrichment experiment, including sample information, the genomic read coverage, CpG density and other parameters are stored in a "qseaSet" object. To create such an object, call the function "createQseaSet", that takes the following parameters:
qseaSet=createQseaSet(sampleTable=samples_NSCLC, BSgenome="BSgenome.Hsapiens.UCSC.hg19", chr.select=paste0("chr", 20:22), window_size=500) qseaSet
We now read the alignment files and compute the MeDIP coverage for each window. For this step, we have to specify the following parameters:
qseaSet=addCoverage(qseaSet, uniquePos=TRUE, paired=TRUE)
The QSEA model can consider Copy Number Variation (CNV), and account for their
influence on MeDIP read density.
CNV can either be computed externally and imported into QSEA,
or estimated directly within QSEA.
QSEA internally uses the bioconductor package r Biocpkg("HMMcopy")
to estimate CNV from sequencing data of the input library,
or from the MeDIP library (parameter "MeDIP"=TRUE).
In the latter case, only fragments that do not contain CpG dinucleotides are considered.
Externally computed CNVs can be imported with the addCNV function by specifying the "cnv" parameter:
If no "cnv" object is provided, the following parameters control the internal CNV analysis:
qseaSet=addCNV(qseaSet, file_name="file_name",window_size=2e6, paired=TRUE, parallel=FALSE, MeDIP=TRUE)
Note, that CNV estimation will benefit from analysing the whole genome. For this step, limiting the estimation to single chromosomes is not recommendable in general.
QSEA accounts for differences in sequencing depth and library composition by estimating the effective library size for each sample using the trimmed mean of m-values approach [TMM, @robinson2010]. Alternatively, QSEA can import user provided normalization factors, using the "factors" parameter in addLibraryFactors.
qseaSet=addLibraryFactors(qseaSet)
As MeDIP seq read coverage is dependent on the CpG density of the fragment, we estimate the average CpG density per fragment for each genomic window.
qseaSet=addPatternDensity(qseaSet, "CG", name="CpG")
From the regions without CpGs we can estimate the coverage offset from background reads.
#load prepared dataset data(NSCLC_dataset, package="MEDIPSData")
qseaSet=addOffset(qseaSet, enrichmentPattern="CpG")
In order to estimate the relative enrichment, we need to model the enrichment efficiency, which is dependent on the CpG density. The parameters for this model can be estimated using the addEnrichmentParameters() function. For this task we need to provide known values for a subset of windows, for example gained by validation experiments or from other studies. QSEA then fits a sigmoidal function to these values, in order to smooth and extrapolate the provided values. The enrichment parameters, together with the estimated offset of background reads are applied in the transformation to absolute methylation values.
For the example dataset, we can make use of methylation data from the
TCGA lung adenocarcinoma [LUAD,@TCGA_LUAD ] and squamous cell carcinoma
[LUSC,@TCGA_LUSC] studies. For this purpose, DNA methylation datasets
have been downloaded from https://gdc.cancer.gov/, and averaged in 500 base
windows, and filtered for highly methylated windows across all samples.
The relevant chromosomes of this data is contained in the "tcga_lung" object
of the r Biocpkg("MEDIPSData")
package.
data(tcga_luad_lusc_450kmeth, package="MEDIPSData") wd=findOverlaps(tcga_luad_lusc_450kmeth, getRegions(qseaSet), select="first") signal=as.matrix(mcols(tcga_luad_lusc_450kmeth)[,rep(1:2,3)]) qseaSet=addEnrichmentParameters(qseaSet, enrichmentPattern="CpG", windowIdx=wd, signal=signal)
In case such information is not available for the analyzed data set, the enrichment model can be calibrated using rough estimates ("blind calibration"). For instance, in adult tissue samples, we can assume high methylation at regions with low CpG density, and linearly decreasing average methylation levels for higher CpG density. However, for different types of samples, such as cell lines or embryonic cells, these assumptions would be a poor fit and lead to false results.
wd=which(getRegions(qseaSet)$CpG_density>1 & getRegions(qseaSet)$CpG_density<15) signal=(15-getRegions(qseaSet)$CpG_density[wd])*.55/15+.25 qseaSet_blind=addEnrichmentParameters(qseaSet, enrichmentPattern="CpG", windowIdx=wd, signal=signal)
In order to review the quality of the MeDIP enrichment and the model assumptions, made in the previous steps, we can check the model parameters estimated by QSEA, which describe the signal to noise ratio, and the enrichment efficiency. At first, we check the estimated fraction of background reads:
getOffset(qseaSet, scale="fraction")
This reveals, that between 9% and 13% of the fragments are due to the background rather than enrichment. High values (e.g. > 0.9) would indicate a lack of enrichment efficiency.
To examine the the sample-wise CpG density dependent enrichment profiles, as estimated by addEnrichmentParameters(), QSEA provides the plotEPmatrix() function:
plotEPmatrix(qseaSet)
The average enrichment, observed in the provided "signal", is depicted in black, and the fitted sigmoidal function in green. Samples with flat profiles might indicate low enrichment efficiency, or poor agreement with the calibration data.
In order to analyze the main characteristics of the data, and to visualize the relationships between the samples, QSEA provides a set of tools for exploratory data analysis.
The plotCNV() function provides a genome wide overview on CNV. Deletations are depicted in green shades, while amplifications are green, and normal copy numbers are blue. Each sample is represented as a row, which is ordered based on the similarity of CNV profiles.
plotCNV(qseaSet)
This plot shows, that one half of chr 20 is amplified in sample 2T, while a large part of chr21 is deleted in sample 1T. The other samples, including 3T, have normal copy numbers. Regions without information (e.g. that are masked in the reference) are represented in gray.
To explore the relationship between samples, a principal component plot can be intuitive. By default, principal component analysis (PCA) is computed based on log transformed normalized count values. In order to use the transformation to absolute methylation values, the "norm_method" parameter is set to "beta". To get an impression of the effects on different levels of annotation, the PCA can be restricted on regions of interest, provided as a GRanges object. To demonstrate this feature, we import an annotation object, containing different UCSC annotation tracks. To obtain CpG Island promoter regions, we intersect CpG Islands with transcription start sites(TSS).
data(annotation, package="MEDIPSData") CGIprom=intersect(ROIs[["CpG Island"]], ROIs[["TSS"]],ignore.strand=TRUE) pca_cgi=getPCA(qseaSet, norm_method="beta", ROIs=CGIprom) col=rep(c("red", "green"), 3) plotPCA(pca_cgi, bg=col, main="PCA plot based on CpG Island Promoters")
In this plot, we can that DNA methylation is different between Tumor and Normal samples, and that the Tumor samples are a heterogeneous group, while the Normal samples cluster tightly together.
Differential Methylation Analysis in QSEA is based on generalized linear models
(GLMs), and a likelihood ratio test, similar to tests performed to detect
deferentially expressed genes, implemented in r Biocpkg("DESeq2")
or
r Biocpkg("edgeR")
.
Briefly, we model read counts with a negative binomial distribution with mean
and dispersion parameter. For each window, we fit a GLM with logarithmic link
function.
For significance testing, we fit a reduced model, and compare the
likelihood ratio (LR) of the models to a Chi-squared distribution.
These steps are implemented in the following functions:
design=model.matrix(~group, getSampleTable(qseaSet) ) qseaGLM=fitNBglm(qseaSet, design, norm_method="beta") qseaGLM=addContrast(qseaSet,qseaGLM, coef=2, name="TvN" )
At fist, the model matrix ("design") for the full model is defined, using the group column of the sample table. The model contains two coeficients: an intercept, and a dummy variable, distinguishing between "Normal" and "Tumor". "fitNBglm" simultaneously fits the dispersion parameters and the model coefficients for the full model. In the next step, "addContrast" fits a model, reduced by the 2nd component (thus, only intercept remains). In addition, the likelihood ratio test is performed. All results are stored in a QSEA GLM object. Note, that one GLM object can contain several contrasts and test results (for more complex experimental designs).
We can now create a results table with raw counts (_reads) estimated % methylation values (_beta), as well as an 95% credibility interval [_betaLB, _betaUB] for the estimates for all deferentially methylated windows. "isSignificant" selects windows below the defined significance threshold.
library(GenomicRanges) sig=isSignificant(qseaGLM, fdr_th=.01) result=makeTable(qseaSet, glm=qseaGLM, groupMeans=getSampleGroups(qseaSet), keep=sig, annotation=ROIs, norm_method="beta") knitr::kable(head(result))
The makeTable function is implemented quite generically, and provides the following options, to specify the regions and the content of the table:
To assess the enrichment of differentially methylated regions within genomic annotations, the regionsStats() functions
sigList=list(gain=isSignificant(qseaGLM, fdr_th=.1,direction="gain"), loss=isSignificant(qseaGLM, fdr_th=.1,direction="loss")) roi_stats=regionStats(qseaSet, subsets=sigList, ROIs=ROIs) knitr::kable(roi_stats) roi_stats_rel=roi_stats[,-1]/roi_stats[,1] x=barplot(t(roi_stats_rel)*100,ylab="fraction of ROIs[%]", names.arg=rep("", length(ROIs)+1), beside=TRUE, legend=TRUE, las=2, args.legend=list(x="topleft"), main="Feature enrichment Tumor vs Normal DNA methylation") text(x=x[2,],y=-.15,labels=rownames(roi_stats_rel), xpd=TRUE, srt=30, cex=1, adj=c(1,0))
As expected for tumor samples, hypomethylated regions are more frequent genome wide, while CpG islands are enriched for hypermethylation.
If we are interested in a particular genomic region, it can be depicted in a genome browser like manner as follows: (find a more interesting example, add annotations ...)
plotCoverage(qseaSet, samples=getSampleNames(qseaSet), chr="chr20", start=38076001, end=38090000, norm_method="beta", col=rep(c("red", "green"), 3), yoffset=1,space=.1, reorder="clust", regions=ROIs["TFBS"],regions_offset=.5, cex=.7 )
A large part of the run time is required for processing the alignment files.
These steps can be parallelized using the r Biocpkg("BiocParallel")
package:
library("BiocParallel") register(MulticoreParam(workers=3)) qseaSet=addCoverage(qseaSet, uniquePos=TRUE, paired=TRUE, parallel=TRUE)
The "parallel" parameter is also available for the addCNV function. Note, that for parallel scanning of files, reading speed might be a limiting factor.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.