knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
methcp
is a differentially methylated region
(DMR) detecting method for whole-genome bisulfite sequencing (WGBS)
data. It is applicable for a wide range of experimental designs.
In this document, we provide examples for two-group comparisons and
time-course analysis.
methcp
identifies DMRs based on change point detection, which
naturally segments the genome and provides region-level
differential analysis. We direct the interested reader to our paper
here
Load packages.
library(bsseq) library(MethCP)
In this section, we use the CpG methylation data from an Arabidopsis dataset
available from GEO with accession number
GSM954584. We
take a subset of chromosome 1 and 2 from each of the samples and perform
differential analysis between the wild-type plants and the H2A.Z mutant
plants, which we refer to as treatment
and control
in the rest of
the document.
We use a well-developed Bioconductor package bsseq
to load the store the raw
data. Below is an example of how to read raw counts using bsseq
.
We provide a helper function createBsseqObject
to create a bsseq object
when the data for each sample is stored in a separate text file.
For more operations regarding the bsseq
object, or to create a bsseq
object customized to your file format, please refer to their
User's Guide.
# The dataset is consist of 6 samples. 3 samples are H2A.Z mutant # plants, and 3 samples are controls. sample_names <- c( paste0("control", seq_len(3)), paste0("treatment", seq_len(3)) ) # Get the vector of file path and names raw_files <- system.file( "extdata", paste0(sample_names, ".txt"), package = "MethCP") # load the data bs_object <- createBsseqObject( files = raw_files, sample_names = sample_names, chr_col = 'Chr', pos_col = 'Pos', m_col = "M", cov_col = 'Cov')
The bsseq
object.
bs_object
Header of the raw file for one of the samples.
dt <- read.table( raw_files[1], stringsAsFactors = FALSE, header = TRUE) head(dt)
We calculate the per-cytosine statistics using two different test DSS
and
methylKit
using the function calcLociStat
. This function returns a
MethCP
object that is used for the future segmentation step. We allow
parallelized computing when there are multiple chromosomes in the dataset.
# the sample names of the two groups to compare. They should be subsets of the # sample names provided when creating the `bsseq` objects. group1 <- paste0("control", seq_len(3)) group2 <- paste0("treatment", seq_len(3)) # Below we calculate the per-cytosine statistics using two different # test `DSS` and `methylKit`. The users may pick one of the two for their # application. # obj_DSS <- calcLociStat(bs_object, group1, group2, test = "DSS") obj_methylKit <- calcLociStat( bs_object, group1, group2, test = "methylKit")
obj_methylKit
In cases the user wants to use their pre-calculated test statistics for
experiments other than two-group comparison and time course data, we use the
calculated statistics and create a MethCP
object.
data <- data.frame( chr = rep("Chr01", 5), pos = c(2, 5, 9, 10, 18), effect.size = c(1,-1, NA, 9, Inf), pvals = c(0, 0.1, 0.9, NA, 0.02)) obj <- MethCPFromStat( data, test.name="myTest", pvals.field = "pvals", effect.size.field="effect.size", seqnames.field="chr", pos.field="pos" )
obj
segmentMethCP
performs segmentation on a MethCP
object. We allow
parallelized computing when there are multiple chromosomes in the dataset.
Different from calcLociStat
function in the previous section, we do not put
any constraint on the number of cores used. Please see the documentation for
adjusting the parameters used in the segmentation.
# obj_DSS <- segmentMethCP( # obj_DSS, bs_object, region.test = "weighted-coverage") obj_methylKit <- segmentMethCP( obj_methylKit, bs_object, region.test = "fisher")
obj_methylKit
Use function getSigRegion
on a MethCP
object to get the list of DMRs.
# region_DSS <- getSigRegion(obj_DSS) # head(region_DSS)
region_methylKit <- getSigRegion(obj_methylKit) head(region_methylKit)
MethCP is flexible for a wide variety of experimental designs. We apply MethCP on an Arabidopsis thaliana seed germination dataset available from the GEO with accession number https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94712 The data is generated from two replicates of dry seed and germinating seeds of wild-type plants (Col-0) and ros1 dml2 dml3 (rdd) triple demethylase mutant plants at 0-4 days after imbibition for 4 days (DAI). For cytosine-based statistics, we fit linear models on the methylation ratios and test the differences between the time coefficient of condition Col-0 and condition rdd.
Read the meta data.
meta_file <- system.file( "extdata", "meta_data.txt", package = "MethCP") meta <- read.table(meta_file, sep = "\t", header = TRUE) head(meta)
Read the counts data.
# Get the vector of file path and names raw_files <- system.file( "extdata", paste0(meta$SampleName, ".tsv"), package = "MethCP") # read files bs_object <- createBsseqObject( files = raw_files, sample_names = meta$SampleName, chr_col = 1, pos_col = 2, m_col = 4, cov_col = 5, header = TRUE)
Apply coverage filter to make sure each loci has total coverage (summed across samples) more than 3 for each condition.
groups <- split(seq_len(nrow(meta)), meta$Condition) coverages <- as.data.frame(getCoverage(bs_object, type = "Cov")) filter <- rowSums(coverages[, meta$SampleName[groups[[1]]]] != 0) >= 3 & rowSums(coverages[, meta$SampleName[groups[[2]]]] != 0) >= 3 bs_object <- bs_object[filter, ]
Calculate the statistics. A dataframe of the meta data will be passed to
function calcLociStatTimeCourse
. Note that there must be columns named
Condition
, Time
and SampleName
in the dataframe.
obj <- calcLociStatTimeCourse(bs_object, meta)
obj
Segmentation.
obj <- segmentMethCP(obj, bs_object, region.test = "stouffer")
Get the DMRs.
regions <- getSigRegion(obj)
head(regions)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.