#' @title Repetitive element methylation prediction
#'
#' @description
#' \code{remp} is used to predict genomewide methylation levels of locus-specific repetitive elements (RE).
#' Two major RE types in human, Alu element (Alu) and LINE-1 (L1) are available.
#'
#' @param methyDat A \code{\link{RatioSet}}, \code{\link{GenomicRatioSet}}, \code{\link{DataFrame}},
#' \code{data.table}, \code{data.frame}, or \code{matrix} of Illumina BeadChip methylation data
#' (450k or EPIC array) or Illumina methylation percentage estimates by sequencing. See Details.
#' Alternatively, user can also specify a pre-built data template (see \code{\link{rempTemplate}}).
#' \code{remp} to carry out the prediction. See \code{\link{rempTemplate}}. With template specified, \code{methyDat},
#' \code{REtype}, \code{parcel}, and \code{work.dir} can be skipped.
#' @param REtype Type of RE. Currently \code{"Alu"}, \code{"L1"}, and \code{"ERV"} are supported. If \code{NULL},
#' the type of RE will be extracted from \code{parcel}.
#' @param Seq.GR A \code{\link{GRanges}} object containing genomic locations of the CpGs profiled by sequencing
#' platforms. This parameter should not be \code{NULL} if the input methylation data \code{methyDat} are
#' obtained by sequencing. Note that the genomic location can be in either hg19 or hg38 build.
#' See details in \code{\link{initREMP}}.
#' @param parcel An \code{\link{REMParcel}} object containing necessary data to carry out the
#' prediction. If \code{NULL}, \code{REtype} must specify a type of RE so that the function can search the
#' \code{.rds} data file in \code{work.dir} exported by \code{\link{initREMP}} (with \code{export = TRUE})
#' or \code{\link{saveParcel}}.
#' @param work.dir Path to the directory where the annotation data generated by \code{\link{initREMP}}
#' are saved. Valid when the argument \code{parcel} is missing. If not specified, temporary directory
#' \code{tempdir()} will be used. If specified, the directory path has to be the same as the
#' one specified in \code{\link{initREMP}} or in \code{\link{saveParcel}}.
#' @param win An integer specifying window size to confine the upstream and downstream flanking
#' region centered on the predicted CpG in RE for prediction. Default = \code{1000}. See Details.
#' @param method Name of model/approach for prediction. Currently \code{"rf"} (Random Forest),
#' \code{"xgbTree"} (Extreme Gradient Boosting), \code{"svmLinear"} (SVM with linear kernel), \code{"svmRadial"}
#' (SVM with radial kernel), and \code{"naive"} (carrying over methylation values of the closest
#' CpG site) are available. Default = \code{"rf"} (Random Forest). See Details.
#' @param autoTune Logical parameter. If \code{TRUE}, a 3-time repeated 5-fold cross validation
#' will be performed to determine the best model parameter. If \code{FALSE}, the \code{param} option
#' (see below) must be specified. Default = \code{TRUE}. Auto-tune will be disabled using Random Forest.
#' See Details.
#' @param param A list specifying fixed model tuning parameter(s) (not applicable for Random Forest, see Details).
#' For Extreme Gradient Boosting, \code{param} list must contain '\code{$nrounds}', '\code{$max_depth}',
#' '\code{$eta}', '\code{$gamma}', '\code{$colsample_bytree}', '\code{$min_child_weight}', and '\code{$subsample}'.
#' See \code{xgbTree} in package \code{caret}. For SVM, \code{param} list must contain '\code{$C}' (cost) for linear kernel
#' or '\code{$sigma}' and '\code{$C}' for radial basis function kernel. This parameter is valid only
#' when \code{autoTune = FALSE}.
#' @param seed Random seed for Random Forest model for reproducible prediction results.
#' Default is \code{NULL}, which generates a seed.
#' @param ncore Number of cores used for parallel computing. By default, max number of cores available
#' in the machine will be utilized. If \code{ncore = 1}, no parallel computing is allowed.
#' @param BPPARAM An optional \code{\link{BiocParallelParam}} instance determining the parallel back-end to
#' be used during evaluation. If not specified, default back-end in the machine will be used.
#' @param verbose Logical parameter. Should the function be verbose?
#'
#' @details
#' Before running \code{remp}, user should make sure the methylation data have gone through
#' proper quality control, background correction, and normalization procedures. Both beta value
#' and M value are allowed. Rows represents probes and columns represents samples. For array data,
#' please make sure to have row names that specify the Illumina probe ID (i.e. cg00000029). For sequencing
#' data, please provide the genomic location of CpGs in a \code{\link{GRanges}} obejct and
#' specify it using \code{Seq.GR} parameter. \code{win = 1000} is based on previous findings showing that
#' neighboring CpGs are more likely to be co-modified within 1000 bp. User can specify narrower window size
#' for slight improvement of prediction accuracy at the cost of less predicted RE. Window size greater than 1000 is not
#' recommended as the machine learning models would not be able to learn much userful information
#' for prediction but introduce noise. Random Forest model (\code{method = "rf"}) is recommented
#' as it offers more accurate prediction and it also enables prediction reliability functionality.
#' Prediction reliability is estimated by conditional standard deviation using Quantile Regression Forest.
#' Please note that if parallel computing is allowed, parallel Random Forest
#' (powered by package \code{\link{ranger}}) will be used automatically. The performance of
#' Random Forest model is often relatively insensitive to the choice of \code{mtry}.
#' Therefore, auto-tune will be turned off using Random Forest and \code{mtry} will be set to one third
#' of the total number of predictors. For SVM, if \code{autoTune = TRUE}, preset tuning parameter
#' search grid can be access and modified using \code{\link{remp_options}}.
#'
#' @return A \code{\link{REMProduct}} object containing predicted RE methylation results.
#'
#' @seealso See \code{\link{initREMP}} to prepare necessary annotation database before running \code{remp}.
#'
#' @examples
#' # Obtain example Illumina example data (450k)
#' if (!exists("GM12878_450k"))
#' GM12878_450k <- getGM12878("450k")
#'
#' # Make sure you have run 'initREMP' first. See ?initREMP.
#' if (!exists("remparcel")) {
#' data(Alu.hg19.demo)
#' remparcel <- initREMP(arrayType = "450k",
#' REtype = "Alu",
#' annotation.source = "AH",
#' genome = "hg19",
#' RE = Alu.hg19.demo,
#' ncore = 1,
#' verbose = TRUE)
#' }
#'
#' # With data template pre-built. See ?rempTemplate.
#' if (!exists("template"))
#' template <- rempTemplate(GM12878_450k,
#' parcel = remparcel,
#' win = 1000,
#' verbose = TRUE)
#'
#' # Run remp with pre-built template:
#' remp.res <- remp(template, ncore = 1)
#'
#' # Or run remp without pre-built template (identical results):
#' \dontrun{
#' remp.res <- remp(GM12878_450k,
#' REtype = "Alu",
#' parcel = remparcel,
#' ncore = 1,
#' verbose = TRUE)
#' }
#'
#' remp.res
#' details(remp.res)
#' rempB(remp.res) # Methylation data (beta value)
#'
#' # Extract CpG location information.
#' # This accessor is inherit from class 'RangedSummarizedExperiment')
#' rowRanges(remp.res)
#'
#' # RE annotation information
#' rempAnnot(remp.res)
#'
#' # Add gene annotation
#' remp.res <- decodeAnnot(remp.res, type = "symbol")
#' rempAnnot(remp.res)
#'
#' # (Recommended) Trim off less reliable prediction
#' remp.res <- rempTrim(remp.res)
#'
#' # Obtain RE-level methylation (aggregate by mean)
#' remp.res <- rempAggregate(remp.res)
#' rempB(remp.res) # Methylation data (beta value)
#'
#' # Extract RE location information
#' rowRanges(remp.res)
#'
#' # Density plot across predicted RE
#' remplot(remp.res)
#'
#' @export
remp <- function(methyDat = NULL,
REtype = c("Alu", "L1", "ERV"),
Seq.GR = NULL,
parcel = NULL,
work.dir = tempdir(),
win = 1000,
method = c("rf", "xgbTree", "svmLinear", "svmRadial", "naive"),
autoTune = TRUE,
param = NULL,
seed = NULL,
ncore = NULL,
BPPARAM = NULL,
verbose = FALSE) {
currenT <- Sys.time()
if (is.null(methyDat)) stop("Methylation dataset (methyDat) is missing.")
if (!is.null(Seq.GR)) {
.isGROrStop(Seq.GR)
names(Seq.GR) <- NULL
}
if (!is.null(param) & !is(param, "list")) stop("Tuning parameter(s) (param) must be a list object.")
method <- match.arg(method)
if (method != "naive" & !autoTune & is.null(param)) {
message("Tuning parameter for method = '", method, "' is missing!")
stop("You turned off auto-tune. Please specify tuning parameter(s).")
}
if (method %in% c("rf", "xgbTree")) {
if (is.null(seed)) {
seed <- sample(.Random.seed, 1)
}
message("Using random seed = ", seed)
if (method == "rf") {
autoTune <- FALSE # No need to tune RF mtry
param <- list(mtry = round(length(remp_options(".default.predictors")) / 3)) # one third of number of predictors
if (verbose) {
message("Note: Auto-tune is disabled for Random Forest and mtry is set to
one third of number of predictors (", param$mtry, ").")
}
}
}
if (is.null(ncore)) {
ncore <- 1
}
if (!is(methyDat, "template")) {
if (win > remp_options(".default.max.flankWindow")) {
stop(
"Flanking window size cannot be greater than ",
remp_options(".default.max.flankWindow"),
". Please see ?remp_options for details."
)
}
## Groom methylation data
methyDat <- grooMethy(methyDat, Seq.GR, verbose = verbose)
arrayType <- gsub("IlluminaHumanMethylation", "", methyDat@annotation["array"])
methyDat <- minfi::getM(methyDat)
## Get the REMParcel object ready
if (is.null(parcel)) {
if (is.null(REtype)) {
stop("REtype and parcel parameters cannot be both empty!")
} else {
REtype <- match.arg(REtype)
}
subDirName <- paste0("REMP.data.", arrayType)
work.dir <- .forwardSlashPath(work.dir)
data.dir <- .forwardSlashPath(file.path(work.dir, subDirName))
path_to_parcel <- file.path(
data.dir,
paste0(
"REMParcel_",
REtype, "_",
arrayType, "_",
remp_options(".default.max.flankWindow"),
".rds"
)
)
if (file.exists(path_to_parcel)) {
remparcel <- readRDS(path_to_parcel)
} else {
stop("Necessary annotation data files are missing. Please make sure:
1) you have run initREMP() first to generate the annotation data. See ?initREMP for details;
2) the RE type (Alu/L1) and/or platform (450k/EPIC/Sequencing) match the annotation data generated by initREMP().
3) parameter 'parcel' is speficied, if the generated annotation data were not exported to local machine;
4) if exported, your working directory specified is the same as you did in initREMP().")
}
} else {
.isREMParcelOrStop(parcel)
if (getParcelInfo(parcel)$platform != arrayType) {
stop(paste0(
"The methylation data type: ", arrayType,
" does not match the REMParcel type: ",
getParcelInfo(parcel)$platform
))
}
remparcel <- parcel
}
TEMPLATE <- rempTemplate(methyDat, Seq.GR, remparcel, win, FALSE)
} else {
TEMPLATE <- methyDat
}
## Setup backend for paralell computing
be <- getBackend(ncore, BPPARAM, verbose)
ncore <- BiocParallel::bpnworkers(be)
message(
"Start RE methylation prediction with ",
BiocParallel::bpnworkers(be), " core(s) ..."
)
REtype <- TEMPLATE$REtype # make sure REtype is consistent with REMParcel
genome <- TEMPLATE$genome
arrayType <- TEMPLATE$arrayType
methyDat <- TEMPLATE$NBCpG_methyDat
RE_NeibCpG <- TEMPLATE$NBCpG_GR
RE_CpG_ILMN_DATA <- TEMPLATE$RECpG_methyDat
RE_CpG_ILMN <- TEMPLATE$RECpG_GR
refgene_main <- TEMPLATE$refgene
RE_refGene.original <- TEMPLATE$RE
samplenames <- colnames(methyDat)
sampleN <- length(samplenames)
#######################################
if ("naive" == method) {
BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
colnames(BEST_TUNE) <- "Not_Applicable"
libToLoad <- NULL
method_text <- "Naive (nearest CpG)"
QCname_text <- "N/A"
}
if ("rf" == method) {
BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
colnames(BEST_TUNE) <- "mtry"
libToLoad <- NULL
method_text <- "Random Forest"
QCname_text <- "Quantile Regression Forest"
}
if ("xgbTree" == method) {
BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 7))
colnames(BEST_TUNE) <- c("nrounds", "max_depth", "eta", "gamma", "colsample_bytree", "min_child_weight", "subsample")
libToLoad <- NULL
method_text <- "Extreme Gradient Boosting"
QCname_text <- "N/A"
}
if ("svmLinear" == method) {
BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 1))
colnames(BEST_TUNE) <- "cost"
libToLoad <- c("kernlab")
method_text <- "SVM(Linear)"
QCname_text <- "N/A"
}
if ("svmRadial" == method) {
BEST_TUNE <- DataFrame(matrix(NA, nrow = sampleN, ncol = 2))
colnames(BEST_TUNE) <- c("sigma", "cost")
libToLoad <- c("kernlab")
method_text <- "SVM(Radial)"
QCname_text <- "N/A"
}
rownames(BEST_TUNE) <- samplenames
if (verbose) {
message("Prediction model: ", method_text, "\nQC model: ", QCname_text)
}
cpgRanges <- unique(RE_NeibCpG[, "RE.Index"])
RE_annotation <- subsetByOverlaps(RE_refGene.original, cpgRanges)
# Note: Some REs have overlapping regions thus there can be more than one REs mapped to one RE-CpG.
RE_annotation_name <- colnames(mcols(RE_annotation))
regionCode <- mcols(RE_annotation)[remp_options(".default.genomicRegionColNames")]
RE_annotation <- RE_annotation[, RE_annotation_name[!RE_annotation_name %in%
remp_options(".default.genomicRegionColNames")]]
# Initiate container
REMP_PREDICT_CpG <- matrix(NA, nrow = length(cpgRanges), ncol = sampleN)
REMP_PREDICT_QC <- matrix(NA, nrow = length(cpgRanges), ncol = sampleN)
REMP_PREDICT_IMP <- setNames(DataFrame(matrix(NA,
nrow = length(remp_options(".default.predictors")),
ncol = sampleN
)), samplenames)
rownames(REMP_PREDICT_IMP) <- remp_options(".default.predictors")
## RE coverage
RE_COVERAGE <- .coverageStats_RE(RE_annotation, regionCode, cpgRanges, RE_CpG_ILMN,
REtype,
indent = " ", verbose
)
# Gene coverage
GENE_COVERAGE <- .coverageStats_GENE(regionCode, refgene_main,
REtype,
indent = " ", verbose
)
## Prediction
tRun <- .timeTrace(currenT)
currenT <- tRun$t
for (i in seq_len(sampleN)) {
message("Predicting sample ", samplenames[i], " ...")
RE_CpG_ILMN_OneMethyDat <- RE_CpG_ILMN_DATA[, samplenames[i]]
# identical(RE_CpG_ILMN$Index, names(RE_CpG_ILMN_OneMethyDat))
### To be predicted
if(verbose) message(" Loading methylation data for prediction...")
RE_unprf_neib <- .loadMethy(methyDat[, i, drop = FALSE], RE_NeibCpG)
# summary(RE_unprf_neib$distance)
### For model training
RE_unprf_neib <- unique(RE_unprf_neib[, c("RE.Index", remp_options(".default.predictors"))])
HITS <- findOverlaps(RE_unprf_neib, RE_CpG_ILMN, ignore.strand = TRUE)
RE_prf_neib <- RE_unprf_neib[queryHits(HITS), ]
RE_prf_neib$Methy <- RE_CpG_ILMN_OneMethyDat[subjectHits(HITS)]
# summary(RE_prf_neib$distance.min2)
# cor(RE_prf_neib$Methy.min, RE_prf_neib$Methy)
P_basic <- .modelTrain(
RE_prf_neib, remp_options(".default.predictors"), method,
autoTune, param, be, seed, verbose
)
best_tune <- as.numeric(P_basic$bestTune)
if (method != "naive") {
REMP_PREDICT_IMP[, i] <- P_basic$importance[remp_options(".default.predictors"), "Overall"]
for (k in seq_len(length(best_tune)))
{
BEST_TUNE[i, k] <- best_tune[k]
}
}
##############################################################################################
newdata <- as.matrix(mcols(RE_unprf_neib)[, remp_options(".default.predictors")])
if (method != "rf") {
if (ncore > 1) {
BiocParallel::bpstart(be)
.bploadLibraryQuiet(libToLoad, be)
ITER <- .iblkrow(newdata, chunks = ncore)
REMP_PREDICT_CpG[, i] <- BiocParallel::bpiterate(ITER, .predictREMP,
model = P_basic$model,
BPPARAM = be,
REDUCE = c,
reduce.in.order = TRUE
)
BiocParallel::bpstop(be)
} else {
REMP_PREDICT_CpG[, i] <- .predictREMP(newdata, P_basic$model)
}
REMP_PREDICT_QC <- NULL
} else {
REMP_PREDICT_CpG[, i] <- predict(P_basic$model, newdata,
type = "response",
num.threads = ncore
)$predictions
REMP_PREDICT_QC[, i] <- .QTF(
rangerObj = P_basic$model,
x = as.matrix(mcols(RE_prf_neib)[, remp_options(".default.predictors")]),
y = as.numeric(mcols(RE_prf_neib)[, "Methy"]),
newX = newdata,
seed = seed,
num.threads = ncore
)
}
tRun <- .timeTrace(currenT)
currenT <- tRun$t
message(
" ", samplenames[i], " completed! ", sampleN - i,
" sample(s) left ...", tRun$t_text
)
}
message("Done.")
remproduct <- REMProduct(
REtype = REtype, genome = genome,
platform = arrayType, win = as.character(win),
predictModel = method_text, QCModel = QCname_text,
rempM = REMP_PREDICT_CpG, rempQC = REMP_PREDICT_QC,
cpgRanges = cpgRanges, sampleInfo = BEST_TUNE,
REannotation = RE_annotation,
RECpG = RE_CpG_ILMN,
regionCode = regionCode,
refGene = refgene_main,
varImp = REMP_PREDICT_IMP,
REStats = RE_COVERAGE, GeneStats = GENE_COVERAGE,
Seed = seed
)
return(remproduct)
} ## End of remp
## --------------------------------------
## Internal functions
.toIndicator <- function(object) {
genomicRegionInd <- matrix(0L,
nrow = length(object),
ncol = length(remp_options(".default.genomicRegionColNames"))
)
genomicRegionInd[!is.na(mcols(object)[, remp_options(".default.genomicRegionColNames")])] <- 1L
genomicRegionInd <- DataFrame(genomicRegionInd)
colnames(genomicRegionInd) <- remp_options(".default.genomicRegionColNames")
mcols(object)[, remp_options(".default.genomicRegionColNames")] <- genomicRegionInd
return(object)
}
.loadMethy <- function(methyDatOne, RE_NeibCpG) {
RE_NeibCpG$Methy.all <- methyDatOne[RE_NeibCpG$Methy.ptr, ]
RE_NeibCpG$Methy.min <- methyDatOne[RE_NeibCpG$Methy.ptr.min, ]
RE_NeibCpG$Methy.min2 <- methyDatOne[RE_NeibCpG$Methy.ptr.min2, ]
RE_NeibCpG.DF <- mcols(RE_NeibCpG)[, c("RE.CpG.ID", "distance_cat", "Methy.all")]
catList <- sort(unique(RE_NeibCpG.DF$distance_cat))
Methy_all_agg <- .aggregateNeib(
Methy.all ~ RE.CpG.ID, RE_NeibCpG.DF,
function(x) sd(x),
c("RE.CpG.ID", "Methy.std")
)
## Moving average
cati <- NULL
for (i in catList)
{
cati <- c(cati, i)
sub_RE_NeibCpG.DF <- RE_NeibCpG.DF[RE_NeibCpG.DF$distance_cat %in% cati, ]
sub_Methy_all_agg <- .aggregateNeib(
Methy.all ~ RE.CpG.ID,
sub_RE_NeibCpG.DF,
function(x) mean(x),
c("RE.CpG.ID", paste0("Methy.mean.mov", i))
)
Methy_all_agg <- merge(Methy_all_agg, sub_Methy_all_agg,
by = "RE.CpG.ID", all.x = TRUE
)
}
## carry over missing
for (j in catList + 2)
{
current <- Methy_all_agg[, j]
k <- 1
while (any(is.na(current)) & j + k <= 6) {
current[is.na(current)] <- Methy_all_agg[is.na(current), j + k]
k <- k + 1
}
Methy_all_agg[, j] <- current
}
RE_NeibCpG_meta <- mcols(RE_NeibCpG)
mcols(RE_NeibCpG) <- cbind(
RE_NeibCpG_meta,
Methy_all_agg[base::match(
RE_NeibCpG_meta$RE.CpG.ID,
Methy_all_agg$RE.CpG.ID
), -1]
)
return(RE_NeibCpG)
}
.aggregateNeib <- function(relation, d, FUN, newColNames) {
d_agg <- aggregate(relation, data = d, FUN)
d_agg <- DataFrame(d_agg[, 1], DataFrame(d_agg[, -1]))
return(setNames(d_agg, newColNames))
}
.modelTrain <- function(d, varname, method, autoTune, param,
be, seed, verbose) {
ncore <- BiocParallel::bpnworkers(be)
d <- as.data.frame(mcols(d)[, c("Methy", varname)])
if (autoTune) {
trC.method <- "repeatedcv"
trC.number <- 5
trC.repeats <- 3
} else {
trC.method <- "none"
trC.number <- 1
trC.repeats <- NA
}
if (method == "naive") {
modelFit <- list(
model = NULL, QC = NULL, importance = NULL, bestTune = NULL,
name = "naive"
)
} else {
# set up the tuning grid
if (method == "rf") {
param_text <- "mtry"
}
if (method == "svmLinear") {
if (autoTune) param <- remp_options(".default.svmLinear.tune")
grid <- expand.grid(C = param$C)
param_text <- "cost"
}
if (method == "svmRadial") {
if (autoTune) param <- remp_options(".default.svmRadial.tune")
grid <- expand.grid(sigma = param$sigma, C = param$C)
param_text <- c("sigma", "cost")
}
if (method == "xgbTree") {
if (autoTune) param <- remp_options(".default.xgbTree.tune")
grid <- expand.grid(
nrounds = param$nrounds, max_depth = param$max_depth, eta = param$eta,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,
min_child_weight = param$min_child_weight, subsample = param$subsample
)
param_text <- c("nrounds", "max_depth", "eta", "gamma", "colsample_bytree", "min_child_weight", "subsample")
}
if (verbose & !autoTune) {
message(
" Pre-specified tuning parameter: ",
paste(paste(param_text, unlist(param), sep = " = "), collapse = ", ")
)
}
if (method == "rf") {
doQC <- TRUE
model.tune <- ranger::ranger(Methy ~ .,
data = d, mtry = param$mtry, num.threads = ncore,
importance = "permutation", keep.inbag = FALSE,
scale.permutation.importance = TRUE, seed = seed
)
var_imp <- ranger::importance(model.tune)
var_imp <- data.frame(
Predictor = names(var_imp),
Overall = var_imp
)
best_param <- model.tune$mtry
model <- model.tune
} else {
set.seed(seed)
if (ncore > 1) {
trC <- caret::trainControl(
method = trC.method, number = trC.number,
repeats = trC.repeats, allowParallel = TRUE
)
cluster <- parallel::makeCluster(ncore)
doParallel::registerDoParallel(cluster)
model.tune <- caret::train(Methy ~ .,
data = d, method = method,
tuneGrid = grid, trControl = trC, importance = TRUE
)
parallel::stopCluster(cluster)
registerDoSEQ()
} else {
trC <- caret::trainControl(
method = trC.method, number = trC.number,
repeats = trC.repeats, allowParallel = FALSE
)
model.tune <- caret::train(Methy ~ .,
data = d, method = method,
tuneGrid = grid, trControl = trC, importance = TRUE
)
}
var_imp <- caret::varImp(model.tune)
var_imp <- data.frame(
Predictor = rownames(var_imp$importance),
Overall = var_imp$importance
)
best_param <- model.tune$bestTune
model <- model.tune$finalModel
}
if (verbose & autoTune) {
message(
" Best tuning parameter found: ",
paste(paste(param_text, unlist(best_param), sep = " = "), collapse = ", ")
)
}
modelFit <- list(
model = model,
importance = var_imp,
bestTune = best_param,
name = method
)
}
return(modelFit)
}
# Prediction reliability based on Quantile random forest
.QTF <- function(rangerObj, x, y, newX, seed, num.threads) {
nodesX <- predict(rangerObj, x,
num.threads = num.threads,
type = "terminalNodes"
)$predictions
nnodes <- max(nodesX)
ntree <- ncol(nodesX)
n <- nrow(x)
valuesNodes <- matrix(nrow = nnodes, ncol = ntree)
set.seed(seed)
for (tree in seq_len(ntree)) {
shuffledNodes <- nodesX[rank(ind <- sample(seq_len(n), n)), tree]
useNodes <- sort(unique(as.numeric(shuffledNodes)))
valuesNodes[useNodes, tree] <- y[ind[base::match(useNodes, shuffledNodes)]]
}
predictNodes <- predict(rangerObj, newX, num.threads,
type = "terminalNodes"
)$predictions
valuesPredict <- 0 * predictNodes
for (tree in seq_len(ntree)) {
valuesPredict[, tree] <- valuesNodes[ predictNodes[, tree], tree]
}
result <- apply(valuesPredict, 1, sd)
return(result)
}
.predictREMP <- function(newdata, model) {
if (is.null(model)) {
data.frame(newdata)$Methy.min # newdata is matrix
} else {
if (any(class(model) %in% c("ksvm", "vm"))) {
kernlab::predict(model, newdata)
} # kernlab has its own predict function
else {
predict(model, newdata)
}
}
}
# Count the number of RE by gene type (NM/NR/Total), used by
# .coverageStats_RE()
.countREByGene <- function(RE_annotation, RE_list) {
Total <- length(RE_annotation)
RE_annotation_sub <- RE_annotation[runValue(RE_annotation$Index) %in% RE_list]
NM <- sum(!is.na(RE_annotation_sub$InNM))
NR <- sum(!is.na(RE_annotation_sub$InNR))
Gene <- sum(!is.na(RE_annotation_sub$InNM) | !is.na(RE_annotation_sub$InNR))
Intergenic <- sum(is.na(RE_annotation_sub$InNM) & is.na(RE_annotation_sub$InNR))
return(c(Total = Total, NM = NM, NR = NR, Gene = Gene, Intergenic = Intergenic))
}
.coverageStats_RE <- function(RE_annotation, regionCode, cpgRanges, RE_CpG_ILMN,
REtype, indent, verbose) {
mcols(RE_annotation) <- cbind(mcols(RE_annotation), regionCode)
mcols(cpgRanges) <- mcols(RE_annotation[base::match(as.character(cpgRanges$RE.Index), runValue(RE_annotation$Index))])
RE_annotation_prf <- subsetByOverlaps(RE_annotation, RE_CpG_ILMN, ignore.strand = TRUE)
cpgRanges_prf <- subsetByOverlaps(cpgRanges, RE_CpG_ILMN, ignore.strand = TRUE)
# Profiled RE for prediction
cvr_RE_win_list <- runValue(RE_annotation_prf$Index)
# Profiled RE for prediction + predicted unprofiled RE
cvr_unRE_win_list <- runValue(RE_annotation$Index)
REStats <- DataFrame(cbind(
.countREByGene(RE_annotation_prf, cvr_RE_win_list),
.countREByGene(cpgRanges_prf, cvr_RE_win_list),
.countREByGene(RE_annotation, cvr_unRE_win_list),
.countREByGene(cpgRanges, cvr_unRE_win_list)
))
colnames(REStats) <- c("Trained_RE", "Trained_RECpG", "Predict_RE", "Predict_RECpG")
if (verbose) {
.showREStats(REStats, REtype, "message", indent, TRUE)
}
return(REStats)
}
# Count the number of genes covered by RE, used by
# .coverageStats_GENE()
.countCoveredGene <- function(regionCode, refgene_main) {
InNM <- na.omit(regionCode$InNM)
InNR <- na.omit(regionCode$InNR)
NM_ind <- as.numeric(unique(unlist(strsplit(InNM, "[|]"))))
NR_ind <- as.numeric(unique(unlist(strsplit(InNR, "[|]"))))
gene_ind <- c(NM_ind, NR_ind)
NM <- length(unique(refgene_main$GeneSymbol[NM_ind]))
NR <- length(unique(refgene_main$GeneSymbol[NR_ind]))
Gene <- length(unique(refgene_main$GeneSymbol[gene_ind]))
return(c(NM = NM, NR = NR, Gene = Gene))
}
# indent = " "
.coverageStats_GENE <- function(regionCode, refgene_main,
REtype, indent, verbose) {
## Count the total number of gene (protein coding gene and noncoding RNA
## gene)
totgene <- length(unique(refgene_main$GeneSymbol))
totgene_NM <- length(unique(refgene_main[refgene_main$type == "NM", ]$GeneSymbol))
totgene_NR <- length(unique(refgene_main[refgene_main$type == "NR", ]$GeneSymbol))
GeneStats <- DataFrame(rbind(
c(totgene_NM, totgene_NR, totgene),
.countCoveredGene(regionCode, refgene_main)
))
rownames(GeneStats) <- c("Total", "Predicted RE")
if (verbose) {
.showGeneStats(GeneStats, REtype, "message", indent)
}
return(GeneStats)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.