#' generateMhlReport
#'
#' @description
#' This function computes \emph{Linearised} Methylated Haplotype Load
#' (\eqn{lMHL}) per genomic position.
#'
#' @details
#' The function reports \emph{Linearised} Methylated Haplotype Load
#' (\eqn{lMHL}) at the level of individual cytosines using BAM file location
#' or preprocessed data as an input. Function uses the following formula:
#'
#' \deqn{lMHL=\frac{\sum_{i=1}^{L'} w_{i} \times MH_{i}}{\sum_{i=1}^{L'} w_{i} \times H_{i}}}
#'
#' where \eqn{L'} is the length of a calculation window
#' (e.g., number of CpGs; \eqn{L' \le L},
#' where \eqn{L} is the length of a haplotype covering current genomic
#' position),
#' \eqn{MH_{i}} is a number of fully successive methylated stretches
#' with \eqn{i} loci within a methylated stretch that overlaps
#' current genomic position,
#' \eqn{H_{i}} is a number of fully successive stretches with \eqn{i} loci,
#' \eqn{w_{i}} is a weight for \eqn{i}-locus haplotype (\eqn{w_{i}=i}).
#'
#' This formula is a modification of the original
#' Methylated Haplotype Load (MHL) formula that was
#' first described by Guo et al., 2017
#' (doi: \href{https://doi.org/10.1038/ng.3805}{10.1038/ng.3805}):
#'
#' \deqn{MHL=\frac{\sum_{i=1}^{L} w_{i} \times \frac{MH_{i}}{H_{i}}}{\sum_{i=1}^{L} w_{i}}}
#'
#' where \eqn{L} is the length of a longest haplotype covering current genomic
#' position, \eqn{\frac{MH_{i}}{H_{i}}=P(MH_{i})} is the fraction
#' of fully successive methylated stretches with \eqn{i} loci, \eqn{w_{i}} is
#' a weight for \eqn{i}-locus haplotype (\eqn{w_{i}=i}).
#'
#' The modifications to original formula are made in order to:
#' \itemize{
#' \item \bold{provide granularity of values} --- the original MHL
#' formula gives the same MHL value for every cytosine of a partially
#' methylated haplotype (e.g., MHL=0.358 for each cytosine within a read with
#' methylation call string "zZZZ"). In contrast, \eqn{lMHL}==0 for the
#' non-methylated cytosines (e.g., \eqn{lMHL}==c(0, 0.5, 0.5, 0.5) for
#' cytosines within a read with methylation call string "zZZZ").
#' \item \bold{enable calculations for long-read sequencing alignments} ---
#' \eqn{lMHL}
#' calculation window can be limited to a particular number of cytosines.
#' This allows to use the formula for very long haplotypes as well as
#' to compare values for sequencing data of varying read length.
#' \item \bold{reduce the complexity of MHL calculation} for data of high
#' breadth and depth --- \eqn{lMHL} values for all genomic positions can be
#' calculated using a single pass (cycling through reads just once) as
#' the linearised calculations of numerator and denominator for
#' \eqn{lMHL} do not require prior knowledge on how many reads cover
#' a particular position. This is achieved by moving \eqn{H_{i}} multiplier
#' to the denominator of the \eqn{lMHL} formula.
#' }
#'
#' These modifications make \eqn{lMHL} calculation similar though
#' \emph{non-equivalent} to the original MHL.
#' However, the most important property of MHL --- emphasis on
#' hypermethylated blocks --- is retained. And in return, \eqn{lMHL} gets better
#' applicability for analysis of sequencing data of varying depth and read
#' length.
#'
#' Other notes on function's behaviour:
#'
#' Methylation string bases in unknown context ("uU") are simply ignored, which,
#' to the best of our knowledge, is consistent with the behaviour of other
#' tools.
#'
#' Cytosine context present in more than 50\% of the reads is assumed
#' to be correct, while all bases at the same position but having other
#' methylation context are simply ignored. This allows reports to be prepared
#' without using the reference genome sequence.
#'
#' @param bam BAM file location string OR preprocessed output of
#' \code{\link[epialleleR]{preprocessBam}} function. Read more about BAM file
#' requirements and BAM preprocessing at \code{\link{preprocessBam}}.
#' @param report.file file location string to write the \eqn{lMHL} report.
#' If NULL (the default) then report is returned as a
#' \code{\link[data.table]{data.table}} object.
#' @param haplotype.context string for a cytosine context that defines
#' a haplotype:
#' \itemize{
#' \item "CG" (the default) -- CpG cytosines only (called as zZ)
#' \item "CHG" -- CHG cytosines only (xX)
#' \item "CHH" -- CHH cytosines only (hH)
#' \item "CxG" -- CG and CHG cytosines (zZxX)
#' \item "CX" -- all cytosines; this, as well as the other non-CG contexts,
#' may have little sense but still included for consistency
#' }
#' If \eqn{lMHL} calculations are needed for all three possible cytosine
#' contexts \emph{independently}, one has to run this function for
#' each required `haplotype.context` separately, because
#' `haplotype.context`=="CX" assumes that \emph{any} cytosine context
#' is allowed within the same haplotype. This behaviour may change in the
#' future.
#' @param max.haplotype.window non-negative integer for maximum value of
#' \eqn{L'} in \eqn{lMHL} formula.
#' When 0 (the default), calculations are performed for the full haplotype
#' length (\eqn{L'=L}, although the maximum value is currently limited to
#' 65535). Having no length restrictions make sense for short-read sequencing
#' when the
#' length of the read is comparable to the length of a typical methylated block,
#' the depth of coverage is high, and the lengths of all reads are roughly
#' equal.
#' However, calculations using non-restricted haplotype length are meaningless
#' for long-read sequencing --- when the same read may
#' cover a number of regions with very different methylation properties, and
#' reads themselves can be of a very different length. In the latter case it is
#' advised to limit the `max.haplotype.window` to a number of cytosines in a
#' typical hypermethylated region. For thorough
#' explanation and more examples, see Details section and vignette.
#' @param min.haplotype.length non-negative integer for minimum length of a
#' haplotype (default: 0 will include haplotypes of any length).
#' When `min.haplotype.length`>0, reads
#' (read pairs) with fewer than `min.haplotype.length` cytosines
#' within the `haplotype.context` are skipped.
#' @param max.outofcontext.beta real number in the range [0;1] (default: 0.1).
#' Reads (read pairs) with average beta value for out-of-context cytosines
#' \strong{above} this threshold are skipped. Set to 1 to disable filtering.
#' @param ... other parameters to pass to the
#' \code{\link[epialleleR]{preprocessBam}} function.
#' Options have no effect if preprocessed BAM data was supplied as an input.
#' @param gzip boolean to compress the report (default: FALSE).
#' @param verbose boolean to report progress and timings (default: TRUE).
#' @return \code{\link[data.table]{data.table}} object containing \eqn{lMHL}
#' report or NULL if report.file was specified. The
#' report columns are:
#' \itemize{
#' \item rname -- reference sequence name (as in BAM)
#' \item strand -- strand
#' \item pos -- cytosine position
#' \item context -- methylation context
#' \item coverage -- number of reads (read pairs) that include this position
#' \item length -- average length of a haplotype, i.e.,
#' average number of cytosines within `haplotype.context` for
#' reads (read pairs) that include this position
#' \item lmhl -- \eqn{lMHL} value
#' }
#' @seealso `values` vignette for a comparison and visualisation of epialleleR
#' output values for various input files. `epialleleR` vignette for the
#' description of usage and sample data.
#'
#' \code{\link{preprocessBam}} for preloading BAM data,
#' \code{\link{generateCytosineReport}} for other methylation statistics at the
#' level of individual cytosines,
#' \code{\link{generateBedReport}} for genomic region-based statistics,
#' \code{\link{generateVcfReport}} for evaluating epiallele-SNV associations,
#' \code{\link{extractPatterns}} for exploring methylation patterns
#' and \code{\link{plotPatterns}} for pretty plotting of its output,
#' \code{\link{generateBedEcdf}} for analysing the distribution of per-read
#' beta values.
#' @examples
#' capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
#'
#' # lMHL report
#' mhl.report <- generateMhlReport(capture.bam)
#'
#' # lMHL report with a `max.haplotype.window` of 1 is identical to a
#' # conventional cytosine report (or nearly identical when sequencing errors
#' # are present)
#' mhl.report <- generateMhlReport(capture.bam, max.haplotype.window=1)
#' cg.report <- generateCytosineReport(capture.bam, threshold.reads=FALSE)
#' identical(
#' mhl.report[, .(rname, strand, pos, context, value=lmhl)],
#' cg.report[ , .(rname, strand, pos, context, value=meth/(meth+unmeth))]
#' )
#' @export
generateMhlReport <- function (bam,
report.file=NULL,
haplotype.context=c("CG", "CHG", "CHH", "CxG", "CX"),
max.haplotype.window=0,
min.haplotype.length=0,
max.outofcontext.beta=0.1,
...,
gzip=FALSE,
verbose=TRUE)
{
haplotype.context <- match.arg(haplotype.context, haplotype.context)
bam <- preprocessBam(bam.file=bam, ..., verbose=verbose)
mhl.report <- .getMhlReport(
bam.processed=bam,
ctx=paste(.context.to.bases[[haplotype.context]][c("ctx.meth", "ctx.unmeth")], collapse=""),
max.window=max.haplotype.window, min.length=min.haplotype.length,
max.ooctx.beta=max.outofcontext.beta,
verbose=verbose
)
if (is.null(report.file))
return(mhl.report)
else
.writeReport(report=mhl.report, report.file=report.file, gzip=gzip,
verbose=verbose)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.