#' @title Wrapper function running the smoothed-EM algorithm to estimate
#' covariate effects and test regional association in Bisulfite
#' Sequencing-derived methylation data
#' @description This function splits the methylation data into regions
#' (according to different approaches) and, for each region, fits a
#' (dispersion-adjusted) binomial regression model to regional methylation data,
#' and reports the estimated smooth covariate effects and regional p-values
#' for the test of DMRs (differentially methylation regions). Over or under
#' dispersion across loci is accounted for in the model by the combination
#' of a multiplicative dispersion parameter (or scale parameter) and a
#' sample-specific random effect.
#' @description This method can deal with outcomes, i.e. the number of
#' methylated reads in a region, that are contaminated by known
#' false methylation calling rate (\code{p0}) and false non-methylation
#' calling rate (\code{1-p1}).
#' @description The covariate effects are assumed to smoothly vary across
#' genomic regions. In order to estimate them, the algorithm first
#' represents the functional parameters by a linear combination
#' of a set of restricted cubic splines (with dimension
#' \code{n.k}), and a smoothness penalization term
#' which depends on the smoothing parameters \code{lambdas} is also
#' added to control smoothness. The estimation is performed by an iterated
#' EM algorithm. Each M step constitutes an outer Newton's iteration to estimate
#' smoothing parameters \code{lambdas} and an inner P-IRLS iteration to estimate
#' spline coefficients \code{alpha} for the covariate effects.
#' Currently, the computation in the M step depends the implementation of
#' \code{gam()} in package \code{mgcv}.
#' @param dat a data frame with rows as individual CpGs appearing
#' in all the samples. The first 4 columns should contain the information of
#' `Meth_Counts` (methylated counts), `Total_Counts` (read depths),
#' `Position` (Genomic position for the CpG site) and `ID` (sample ID).
#' The covariate information, such as disease status or cell type composition,
#' are listed in column 5 and onwards.
#' @param split this \code{list} must contain at least the element
#' \code{approach} which corresponds to the partitioning approach used to split
#' the data into independent regions. The partitioning methods
#' available are:
#' \itemize{
#' \item \code{"region"} (partitioning based on the spacing of CpGs),
#' \item \code{"density"} (partitioning based on CpG density),
#' \item \code{"chromatin"} (partitioning based on chromatin states),
#' \item \code{"gene"} (partitioning based on gene regions),
#' \item \code{"granges"} (partitioning based on user-specific
#' annotations provided as a \code{GenomicRanges} object),
#' \item \code{"bed"} (partitioning based on user-specific
#' annotations provided in a BED file).
#' }
#' This list should also contain additional parameters specific to each
#' partitioning approach (see the documentation of each approach for details).
#' @param min.cpgs positive integer defining the minimum number of
#' CpGs within a region for the algorithm to perform optimally.
#' The default value is 50.
#' @param max.cpgs positive integer defining the maximum number of
#' CpGs within a region for the algorithm to perform optimally.
#' The default value is 2000.
#' @param n.k a vector of basis dimensions for the intercept and
#' individual covariates. \code{n.k} specifies an upper limit of the degrees of
#' each functional parameters. The length of n.k should equal to the number of
#' covariates plus 1 (for the intercept)). We recommend basis dimensions n.k,
#' approximately equal to the number of unique CpGs in the region divided by 20.
#' This parameter will be computed automatically, when several regions are
#' generated by the partitioning function.
#' @param p0 the probability of observing a methylated read when the underlying
#' true status is unmethylated. \code{p0} is the rate of false methylation
#' calls, i.e. false positive rate.
#' @param p1 the probability of observing a methylated read when the underlying
#' true status is methylated. \code{1-p1} is the rate of false non-methylation
#' calls, i.e. false negative rate.
#' @param covs a vector of covariate names. The covariates with names in
#' \code{covs} will be included in the model and their covariate effects will be
#' estimated. The default is to fit all covariates in \code{dat}
#' @param Quasi whether a Quasi-likelihood estimation approach will be used;
#' in other words, whether a multiplicative dispersion is added in the model
#' or not.
#' @param epsilon numeric; stopping criterion for the closeness of estimates of
#' spline coefficients from two consecutive iterations.
#' @param epsilon.lambda numeric; stopping criterion for the closeness of
#' estimates of smoothing parameter \code{lambda} from two consecutive
#' iterations.
#' @param maxStep the algorithm will step if the iteration steps exceed
#' \code{maxStep}.
#' @param binom.link the link function used in the binomial regression model;
#' the default is the logit link.
#' @param method the method used to estimate the smoothing
#' parameters. The default is the 'REML' method which is generally better than
#' prediction based criterion \code{GCV.cp}.
#' @param RanEff whether sample-level random effects are added or not
#' @param reml.scale whether a REML-based scale (dispersion) estimator is used.
#' The default is Fletcher-based estimator.
#' @param scale negative values mean scale parameter should be estimated; if a
#' positive value is provided, a fixed scale will be used.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function returns a \code{list} of models (one by independent
#' region) including objects:
#' \itemize{
#' \item \code{est}: estimates of the spline basis coefficients \code{alpha}
#' \item \code{lambda}: estimates of the smoothing parameters for each
#' functional parameters
#' \item \code{est.pi}: predicted methylation levels for each row in the input
#' \code{data}
#' \item \code{ite.points}: estimates of \code{est}, \code{lambda} at each EM
#' iteration
#' \item \code{cov1}: estimated variance-covariance matrix of the basis
#' coefficients \code{alphas}
#' \item \code{reg.out}: regional testing output obtained using Fletcher-based
#' dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE
#' \item \code{reg.out.reml.scale}: regional testing output obtained using
#' REML-based dispersion estimate;
#' \item \code{reg.out.gam}: regional testing output obtained using
#' (Fletcher-based) dispersion estimate from mgcv package;
#' \item \code{phi_fletcher}: Fletcher-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_reml}: REML-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_gam}: Estimated dispersion parameter reported by mgcv;
#' \item \code{SE.out}: a matrix of the estimated pointwise Standard Errors
#' (SE); number of rows are the number of unique CpG sites in the input data and
#' the number of columns equal to the total number of covariates fitted in the
#' model (the first one is the intercept);
#' \item \code{SE.out.REML.scale}: a matrix of the estimated pointwise Standard
#' Errors (SE); the SE calculated from the REML-based dispersion estimates
#' \item \code{uni.pos}: the genomic postions for each row of CpG sites in the
#' matrix \code{SE.out};
#' \item \code{Beta.out}: a matrix of the estimated covariate effects beta(t),
#' where t denotes the genomic positions;
#' \item \code{ncovs}: number of functional paramters in the model (including
#' the intercept);
#' \item \code{sigma00}: estimated variance for the random effect if RanEff is
#' TRUE; NA if RanEff is FALSE.
#' }
#' @author Audrey Lemaçon
#' @examples
#' #------------------------------------------------------------#
#' data(RAdat)
#' RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
#' outs <- runSOMNiBUS(
#' dat=RAdat.f, split = list(approach = "region", gap = 1e6), min.cpgs = 5,
#' n.k = rep(5,3), p0 = 0.003, p1 = 0.9
#' )
#'
#' @export
runSOMNiBUS <- function(dat, split = list(approach = "region"),
min.cpgs = 50, max.cpgs = 2000,
n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE,
epsilon = 10^(-6), epsilon.lambda = 10^(-3),
maxStep = 200, binom.link = "logit", method = "REML",
covs = NULL, RanEff = TRUE,
reml.scale = FALSE, scale = -2, verbose = TRUE) {
t0 <- Sys.time()
msg <- paste("Processing methylation data.")
if(verbose) Message(msg)
dat <- data.frame(dat)
if (is.factor(dat$Position)) {
## message('The Position in the data set should be numeric other than a
## factor')
dat$Position <- as.numeric(as.character(dat$Position))
}
if (any(!c("Meth_Counts", "Total_Counts",
"Position", "ID") %in% colnames(dat))) {
Error(paste("Please make sure object \"dat\" have",
"columns named as \"Meth_Counts\",",
"\"Total_Counts\", \"Position\"",
"and \"ID\" "))
}
# filter out "bad data"
bad_dat_idx <- which(dat$Total_Counts == 0)
if(length(bad_dat_idx) > 0){
dat <- dat[-bad_dat_idx,]
msg <- paste("Remove",length(bad_dat_idx),
"rows with Total_Counts equal to 0.")
if(verbose) Message(msg)
}
## test if the split list contains the 'approach' exists
if(!"approach" %in% names(split)){
Warning(paste("Missing partitioning method. We will used",
"the default approach \"region\"."))
split$approach <- "region"
}
regions <- switch(split$approach,
region = splitDataByRegion(dat = dat,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
density = splitDataByDensity(dat = dat,
window.size = split$window.size,
min.density = split$min.density,
by = split$by,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
chromatin = splitDataByChromatin(dat = dat,
chr = split$chr,
cell.line = split$cell.line,
states = split$states,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
gene = splitDataByGene(dat = dat,
chr = split$chr,
organism = split$organism,
build = split$build,
types = split$types,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
granges = splitDataByGRanges(dat = dat,
chr = split$chr,
annots = split$annots,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
bed = splitDataByBed(dat = dat,
chr = split$chr,
bed = split$bed,
gap = split$gap,
min.cpgs = min.cpgs,
max.cpgs = max.cpgs,
verbose = verbose),
{
Warning(paste("Unknown partitioning",
"method. We will used",
"the default approach",
"\"region\"."))
splitDataByRegion(dat = dat, min.cpgs = min.cpgs)
}
)
msg <- paste("Input partitioned into",length(regions),
"independent region(s)")
if(verbose) Message(msg, step = NULL)
results <- lapply(X = names(regions), function(n){
region <- regions[[n]]
msg <- paste("Starting analysis for",n,
paste0("(",length(unique(region$Position))," unique CpGs)"))
if(verbose) Message(msg, step = NULL)
if(length(regions) > 1){
n.k_dim1 <- max(1, round(length(unique(region$Position))/20))
n.k_dim2 <- ncol(region) - 3
n.k <- rep(n.k_dim1, n.k_dim2)
msg <- paste0("n.k parameter set to rep(", n.k_dim1,",",n.k_dim2,")")
if(verbose) Message(msg, step = NULL)
}
binomRegMethModel(data = region, n.k = n.k, p0 = p0, p1 = p1,
Quasi = Quasi, epsilon = epsilon,
epsilon.lambda = epsilon.lambda,
maxStep = maxStep, binom.link = binom.link,
method = method, covs = covs, RanEff = RanEff,
reml.scale = reml.scale, scale = scale, verbose = verbose)
})
names(results) <- names(regions)
msg <- paste("Process completed in",
format(Sys.time() - t0, digits = 2))
if(verbose) Message(msg, step = "Finished")
return(results)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.