R/generateMhlReport.R

#' generateMhlReport
#'
#' @description
#' This function computes \emph{Linearised} Methylated Haplotype Load
#' (\eqn{lMHL}) per genomic position.
#'
#' @details
#' The function reports \emph{Linearised} Methylated Haplotype Load
#' (\eqn{lMHL}) at the level of individual cytosines using BAM file location
#' or preprocessed data as an input. Function uses the following formula:
#' 
#' \deqn{lMHL=\frac{\sum_{i=1}^{L'} w_{i} \times MH_{i}}{\sum_{i=1}^{L'} w_{i} \times H_{i}}}
#' 
#' where \eqn{L'} is the length of a calculation window
#' (e.g., number of CpGs; \eqn{L' \le L},
#' where \eqn{L} is the length of a haplotype covering current genomic
#' position),
#' \eqn{MH_{i}} is a number of fully successive methylated stretches
#' with \eqn{i} loci within a methylated stretch that overlaps
#' current genomic position,
#' \eqn{H_{i}} is a number of fully successive stretches with \eqn{i} loci,
#' \eqn{w_{i}} is a weight for \eqn{i}-locus haplotype (\eqn{w_{i}=i}).
#' 
#' This formula is a modification of the original 
#' Methylated Haplotype Load (MHL) formula that was 
#' first described by Guo et al., 2017
#' (doi: \href{https://doi.org/10.1038/ng.3805}{10.1038/ng.3805}): 
#' 
#' \deqn{MHL=\frac{\sum_{i=1}^{L} w_{i} \times \frac{MH_{i}}{H_{i}}}{\sum_{i=1}^{L} w_{i}}}
#' 
#' where \eqn{L} is the length of a longest haplotype covering current genomic
#' position, \eqn{\frac{MH_{i}}{H_{i}}=P(MH_{i})} is the fraction
#' of fully successive methylated stretches with \eqn{i} loci, \eqn{w_{i}} is
#' a weight for \eqn{i}-locus haplotype (\eqn{w_{i}=i}).
#' 
#' The modifications to original formula are made in order to:
#' \itemize{
#'   \item \bold{provide granularity of values} --- the original MHL
#'   formula gives the same MHL value for every cytosine of a partially
#'   methylated haplotype (e.g., MHL=0.358 for each cytosine within a read with
#'   methylation call string "zZZZ"). In contrast, \eqn{lMHL}==0 for the
#'   non-methylated cytosines (e.g., \eqn{lMHL}==c(0, 0.5, 0.5, 0.5) for
#'   cytosines within a read with methylation call string "zZZZ").
#'   \item \bold{enable calculations for long-read sequencing alignments} ---
#'   \eqn{lMHL}
#'   calculation window can be limited to a particular number of cytosines.
#'   This allows to use the formula for very long haplotypes as well as
#'   to compare values for sequencing data of varying read length.
#'   \item \bold{reduce the complexity of MHL calculation} for data of high
#'   breadth and depth --- \eqn{lMHL} values for all genomic positions can be
#'   calculated using a single pass (cycling through reads just once) as
#'   the linearised calculations of numerator and denominator for
#'   \eqn{lMHL} do not require prior knowledge on how many reads cover
#'   a particular position. This is achieved by moving \eqn{H_{i}} multiplier
#'   to the denominator of the \eqn{lMHL} formula.
#' }
#' 
#' These modifications make \eqn{lMHL} calculation similar though
#' \emph{non-equivalent} to the original MHL.
#' However, the most important property of MHL --- emphasis on
#' hypermethylated blocks --- is retained. And in return, \eqn{lMHL} gets better
#' applicability for analysis of sequencing data of varying depth and read
#' length.
#' 
#' Other notes on function's behaviour:
#' 
#' Methylation string bases in unknown context ("uU") are simply ignored, which,
#' to the best of our knowledge, is consistent with the behaviour of other
#' tools.
#' 
#' Cytosine context present in more than 50\% of the reads is assumed
#' to be correct, while all bases at the same position but having other
#' methylation context are simply ignored. This allows reports to be prepared
#' without using the reference genome sequence.
#'
#' @param bam BAM file location string OR preprocessed output of
#' \code{\link[epialleleR]{preprocessBam}} function. Read more about BAM file
#' requirements and BAM preprocessing at \code{\link{preprocessBam}}.
#' @param report.file file location string to write the \eqn{lMHL} report.
#' If NULL (the default) then report is returned as a
#' \code{\link[data.table]{data.table}} object.
#' @param haplotype.context string for a cytosine context that defines
#' a haplotype:
#' \itemize{
#'   \item "CG" (the default) -- CpG cytosines only (called as zZ)
#'   \item "CHG" -- CHG cytosines only (xX)
#'   \item "CHH" -- CHH cytosines only (hH)
#'   \item "CxG" -- CG and CHG cytosines (zZxX)
#'   \item "CX" -- all cytosines; this, as well as the other non-CG contexts,
#'   may have little sense but still included for consistency
#' }
#' If \eqn{lMHL} calculations are needed for all three possible cytosine
#' contexts \emph{independently}, one has to run this function for
#' each required `haplotype.context` separately, because
#' `haplotype.context`=="CX" assumes that \emph{any} cytosine context
#' is allowed within the same haplotype. This behaviour may change in the
#' future.
#' @param max.haplotype.window non-negative integer for maximum value of
#' \eqn{L'} in \eqn{lMHL} formula.
#' When 0 (the default), calculations are performed for the full haplotype
#' length (\eqn{L'=L}, although the maximum value is currently limited to
#' 65535). Having no length restrictions make sense for short-read sequencing
#' when the
#' length of the read is comparable to the length of a typical methylated block,
#' the depth of coverage is high, and the lengths of all reads are roughly
#' equal.
#' However, calculations using non-restricted haplotype length are meaningless
#' for long-read sequencing --- when the same read may
#' cover a number of regions with very different methylation properties, and
#' reads themselves can be of a very different length. In the latter case it is
#' advised to limit the `max.haplotype.window` to a number of cytosines in a
#' typical hypermethylated region. For thorough
#' explanation and more examples, see Details section and vignette.
#' @param min.haplotype.length non-negative integer for minimum length of a
#' haplotype (default: 0 will include haplotypes of any length).
#' When `min.haplotype.length`>0, reads
#' (read pairs) with fewer than `min.haplotype.length` cytosines
#' within the `haplotype.context` are skipped.
#' @param max.outofcontext.beta real number in the range [0;1] (default: 0.1).
#' Reads (read pairs) with average beta value for out-of-context cytosines
#' \strong{above} this threshold are skipped. Set to 1 to disable filtering.
#' @param ... other parameters to pass to the
#' \code{\link[epialleleR]{preprocessBam}} function.
#' Options have no effect if preprocessed BAM data was supplied as an input.
#' @param gzip boolean to compress the report (default: FALSE).
#' @param verbose boolean to report progress and timings (default: TRUE).
#' @return \code{\link[data.table]{data.table}} object containing \eqn{lMHL}
#' report or NULL if report.file was specified. The
#' report columns are:
#' \itemize{
#'   \item rname -- reference sequence name (as in BAM)
#'   \item strand -- strand
#'   \item pos -- cytosine position
#'   \item context -- methylation context
#'   \item coverage -- number of reads (read pairs) that include this position
#'   \item length -- average length of a haplotype, i.e.,
#'   average number of cytosines within `haplotype.context` for
#'   reads (read pairs) that include this position
#'   \item lmhl -- \eqn{lMHL} value
#' }
#' @seealso `values` vignette for a comparison and visualisation of epialleleR
#' output values for various input files. `epialleleR` vignette for the
#' description of usage and sample data.
#' 
#' \code{\link{preprocessBam}} for preloading BAM data,
#' \code{\link{generateCytosineReport}} for other methylation statistics at the
#' level of individual cytosines,
#' \code{\link{generateBedReport}} for genomic region-based statistics,
#' \code{\link{generateVcfReport}} for evaluating epiallele-SNV associations,
#' \code{\link{extractPatterns}} for exploring methylation patterns
#' and \code{\link{plotPatterns}} for pretty plotting of its output,
#' \code{\link{generateBedEcdf}} for analysing the distribution of per-read
#' beta values.
#' @examples
#'   capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
#'   
#'   # lMHL report
#'   mhl.report <- generateMhlReport(capture.bam)
#'   
#'   # lMHL report with a `max.haplotype.window` of 1 is identical to a
#'   # conventional cytosine report (or nearly identical when sequencing errors
#'   # are present)
#'   mhl.report <- generateMhlReport(capture.bam, max.haplotype.window=1)
#'   cg.report  <- generateCytosineReport(capture.bam, threshold.reads=FALSE)
#'   identical(
#'     mhl.report[, .(rname, strand, pos, context, value=lmhl)],
#'     cg.report[ , .(rname, strand, pos, context, value=meth/(meth+unmeth))]
#'   )
#' @export
generateMhlReport <- function (bam,
                               report.file=NULL,
                               haplotype.context=c("CG", "CHG", "CHH", "CxG", "CX"),
                               max.haplotype.window=0,
                               min.haplotype.length=0,
                               max.outofcontext.beta=0.1,
                               ...,
                               gzip=FALSE,
                               verbose=TRUE)
{
  haplotype.context <- match.arg(haplotype.context, haplotype.context)
  
  bam <- preprocessBam(bam.file=bam, ..., verbose=verbose)
  
  mhl.report <- .getMhlReport(
    bam.processed=bam, 
    ctx=paste(.context.to.bases[[haplotype.context]][c("ctx.meth", "ctx.unmeth")], collapse=""),
    max.window=max.haplotype.window, min.length=min.haplotype.length,
    max.ooctx.beta=max.outofcontext.beta,
    verbose=verbose
  )
  
  if (is.null(report.file))
    return(mhl.report)
  else
    .writeReport(report=mhl.report, report.file=report.file, gzip=gzip,
                 verbose=verbose)
}
BBCG/epialleleR documentation built on Nov. 19, 2024, 3:59 p.m.