BSmooth: BSmooth, smoothing bisulfite sequence data

View source: R/BSmooth.R

BSmoothR Documentation

BSmooth, smoothing bisulfite sequence data

Description

This implements the BSmooth algorithm for estimating methylation levels from bisulfite sequencing data.

Usage

BSmooth(BSseq,
        ns = 70,
        h = 1000,
        maxGap = 10^8,
        keep.se = FALSE,
        BPPARAM = bpparam(),
        chunkdim = NULL,
        level = NULL,
        verbose = getOption("verbose"))

Arguments

BSseq

An object of class BSseq.

ns

The minimum number of methylation loci in a smoothing window.

h

The minimum smoothing window, in bases.

maxGap

The maximum gap between two methylation loci, before the smoothing is broken across the gap. The default smoothes each chromosome separately.

keep.se

Should the estimated standard errors from the smoothing algorithm be kept. This will make the return object roughly 30 percent bigger and is currently not be used for anything in bsseq.

BPPARAM

An optional BiocParallelParam instance determining the parallel back-end to be used during evaluation. Currently supported are SerialParam (Unix, Mac, Windows), MulticoreParam (Unix and Mac), SnowParam (Unix, Mac, and Windows, limited to single-machine clusters), and BatchtoolsParam (Unix, Mac, Windows, only with the in-memory realization backend). See sections 'Parallelization and progress monitoring' and 'Realization backends' for further details.

chunkdim

Only applicable if BACKEND == "HDF5Array". The dimensions of the chunks to use for writing the data to disk. By default, getHDF5DumpChunkDim() using the dimensions of the returned BSseq object will be used. See ?{getHDF5DumpChunkDim} for more information.

level

Only applicable if BACKEND == "HDF5Array". The compression level to use for writing the data to disk. By default, getHDF5DumpCompressionLevel() will be used. See ?getHDF5DumpCompressionLevel for more information.

verbose

A logical(1) indicating whether progress messages should be printed (default TRUE).

Details

ns and h are passed to the locfit function. The bandwidth used is the maximum (in genomic distance) of the h and a width big enough to contain ns number of methylation loci.

Value

An object of class BSseq, containing coefficients used to fit smoothed methylation values and optionally standard errors for these.

Realization backends

The BSmooth() function creates a new assay to store the coefficients used to construct the smoothed methylation estimates ((coef). An additional assay is also created if keep.se == TRUE (se.coef).

The choice of realization backend controls whether these assay(s) are stored in-memory as an ordinary matrix or on-disk as a HDF5Array, for example.

The choice of realization backend is controlled by the BACKEND argument, which defaults to the current value of DelayedArray::getAutoRealizationBackend().

BSmooth supports the following realization backends:

  • NULL (in-memory): This stores each new assay in-memory using an ordinary matrix.

  • HDF5Array (on-disk): This stores each new assay on-disk in a HDF5 file using an HDF5Matrix from HDF5Array.

Please note that certain combinations of realization backend and parallelization backend are currently not supported. For example, the HDF5Array realization backend is currently only compatible when used with a single-machine parallelization backend (i.e. it is not compatible with a SnowParam that specifies an ad hoc cluster of multiple machines). BSmooth() will issue an error when given such incompatible realization and parallelization backends. Furthermore, to avoid memory usage blow-ups, BSmooth() will issue an error if an in-memory realization backend is used when smoothing a disk-backed BSseq object.

Additional arguments related to the realization backend can be passed via the ... argument. These arguments must be named and are passed to the relevant RealizationSink constructor. For example, the ... argument can be used to specify the path to the HDF5 file to be used by BSmooth(). Please see the examples at the bottom of the page.

Parallelization and progress monitoring

BSmooth() now uses the BiocParallel package to implement parallelization. This brings some notable improvements:

  • Smoothed results can now be written directly to an on-disk realization backend by the worker. This dramatically reduces memory usage compared to previous versions of bsseq that required all results be retained in-memory.

  • Parallelization is now supported on Windows through the use of a SnowParam object as the value of BPPARAM.

  • Detailed and extensive job logging facilities.

All parallelization options are controlled via the BPPARAM argument. In general, we recommend that users combine multicore (single-machine) parallelization with an on-disk realization backend (see section, 'Realization backend'). For Unix and Mac users, this means using a MulticoreParam. For Windows users, this means using a single-machine SnowParam. Please consult the BiocParallel documentation to take full advantage of the more advanced features.

Deprecated arguments

parallelBy, mc.cores, and mc.preschedule are deprecated and will be removed in subsequent releases of bsseq. These arguments were necessary when BSmooth() used the parallel package to implement parallelization, but this functionality is superseded by the aforementioned use of BiocParallel. We recommend that users who previously relied on these arguments switch to BPPARAM = MulticoreParam(workers = mc.cores, progressbar = TRUE).

Progress monitoring

A useful feature of BiocParallel are progress bars to monitor the status of long-running jobs, such as BSmooth(). Progress bars are controlled via the progressbar argument in the BiocParallelParam constructor. Progress bars replace the use of the deprecated verbose argument to print out information on the status of BSmooth().

BiocParallel also supports extensive and detailed logging facilities. Please consult the BiocParallel documentation to take full advantage these advanced features.

Author(s)

Method and original implementation by Kasper Daniel Hansen khansen@jhsph.edu. Updated implementation to support disk-backed BSseq objects and more general parallelization by Peter Francis Hickey.

References

KD Hansen, B Langmead, and RA Irizarry. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology (2012) 13:R83. doi:10.1186/gb-2012-13-10-r83.

See Also

locfit in the locfit package, as well as BSseq.

Examples

## Not run: 
  # Run BSmooth() on a matrix-backed BSseq object using an in-memory realization
  # backend with serial evaluation.
  data(BS.chr22)
  # This is a matrix-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22, BPPARAM = SerialParam(progressbar = TRUE))
  # The new 'coef' assay is an ordinary matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit

  # Run BSmooth() on a disk-backed BSseq object using the HDF5Array realization
  # backend (with data written to the file 'BSmooth_example.h5') with
  # multi-core parallel evaluation.
  BS.chr22 <- realize(BS.chr22, "HDF5Array")
  # This is a disk-backed BSseq object.
  sapply(assays(BS.chr22, withDimnames = FALSE), class)
  BS.fit <- BSmooth(BS.chr22,
              BPPARAM = MulticoreParam(workers = 2, progressbar = TRUE),
              BACKEND = "HDF5Array",
              filepath = "BSmooth_example.h5")
  # The new 'coef' assay is an HDF5Matrix.
  sapply(assays(BS.fit, withDimnames = FALSE), class)
  BS.fit
  # The new 'coef' assay is in the HDF5 file 'BSmooth_example.h5' (in the
  # current working directory).
  sapply(assays(BS.fit, withDimnames = FALSE), path)

## End(Not run)

kasperdanielhansen/bsseq documentation built on Jan. 18, 2025, 3:27 a.m.