dmrseq: Main function for detecting and evaluating significance of...

View source: R/dmrseq.R

dmrseqR Documentation

Main function for detecting and evaluating significance of DMRs.

Description

Performs a two-step approach that (1) detects candidate regions, and (2) scores candidate regions with an exchangeable (across the genome) statistic and evaluates statistical significance using a permuation test on the pooled null distribution of scores.

Usage

dmrseq(
  bs,
  testCovariate,
  adjustCovariate = NULL,
  cutoff = 0.1,
  minNumRegion = 5,
  smooth = TRUE,
  bpSpan = 1000,
  minInSpan = 30,
  maxGapSmooth = 2500,
  maxGap = 1000,
  verbose = TRUE,
  maxPerms = 10,
  matchCovariate = NULL,
  BPPARAM = bpparam(),
  stat = "stat",
  block = FALSE,
  blockSize = 5000,
  chrsPerChunk = 1
)

Arguments

bs

bsseq object containing the methylation values as well as the phenotype matrix that contains sample level covariates

testCovariate

Character value indicating which variable (column name) in pData(bs) to test for association of methylation levels. Can alternatively specify an integer value indicating which of column of pData(bs) to use. This is used to construct the design matrix for the test statistic calculation. To run using a continuous or categorial covariate with more than two groups, simply pass in the name of a column in 'pData' that contains this covariate. A continuous covariate is assmued if the data type in the 'testCovariate' slot is continuous, with the exception of if there are only two unique values (then a two group comparison is carried out).

adjustCovariate

an (optional) character value or vector indicating which variables (column names) in pData(bs) will be adjusted for when testing for the association of methylation value with the testCovariate. Can alternatively specify an integer value or vector indicating which of the columns of pData(bs) to adjust for. If not NULL (default), then this is also used to construct the design matrix for the test statistic calculation.

cutoff

scalar value that represents the absolute value (or a vector of two numbers representing a lower and upper bound) for the cutoff of the single CpG coefficient that is used to discover candidate regions. Default value is 0.10.

minNumRegion

positive integer that represents the minimum number of CpGs to consider for a candidate region. Default value is 5. Minimum value is 3.

smooth

logical value that indicates whether or not to smooth the CpG level signal when discovering candidate regions. Defaults to TRUE.

bpSpan

a positive integer that represents the length in basepairs of the smoothing span window if smooth is TRUE. Default value is 1000.

minInSpan

positive integer that represents the minimum number of CpGs in a smoothing span window if smooth is TRUE. Default value is 30.

maxGapSmooth

integer value representing maximum number of basepairs in between neighboring CpGs to be included in the same cluster when performing smoothing (should generally be larger than maxGap)

maxGap

integer value representing maximum number of basepairs in between neighboring CpGs to be included in the same DMR.

verbose

logical value that indicates whether progress messages should be printed to stdout. Defaults value is TRUE.

maxPerms

a positive integer that represents the maximum number of permutations that will be used to generate the global null distribution of test statistics. Default value is 10.

matchCovariate

An (optional) character value indicating which variable (column name) of pData(bs) will be blocked for when constructing the permutations in order to test for the association of methylation value with the testCovariate, only to be used when testCovariate is a two-group factor and the number of permutations possible is less than 500000. Alternatively, you can specify an integer value indicating which column of pData(bs) to block for. Blocking means that only permutations with balanced composition of testCovariate values will be used (for example if you have samples from different gender and this is not your covariate of interest, it is recommended to use gender as a matching covariate to avoid one of the permutations testing entirely males versus females; this violates the null hypothesis and will decrease power). If not NULL (default), then no blocking is performed.

BPPARAM

a BiocParallelParam object to specify the parallel backend. The default option is BiocParallel::bpparam() which will automatically creates a cluster appropriate for the operating system.

stat

a character vector indicating the name of the column of the output to use as the region-level test statistic. Default value is 'stat' which is the region level-statistic designed to be comparable across the genome. It is not recommended to change this argument, but it can be done for experimental purposes. Possible values are: 'L' - the number of loci in the region, 'area' - the sum of the smoothed loci statistics, 'beta' - the effect size of the region, 'stat' - the test statistic for the region, or 'avg' - the average smoothed loci statistic.

block

logical indicating whether to search for large-scale (low resolution) blocks of differential methylation (default is FALSE, which means that local DMRs are desired). If TRUE, the parameters for bpSpan, minInSpan, and maxGapSmooth should be adjusted (increased) accordingly. This setting will also merge candidate regions that (1) are in the same direction and (2) are less than 1kb apart with no covered CpGs separating them. The region-level model used is also slightly modified - instead of a loci-specific intercept for each CpG in theregion, the intercept term is modeled as a natural spline with one interior knot per each 10kb of length (up to 10 interior knots).

blockSize

numeric value indicating the minimum number of basepairs to be considered a block (only used if block=TRUE). Default is 5000 basepairs.

chrsPerChunk

a positive integer value indicating the number of chromosomes per chunk. The default is 1, meaning that the data will be looped through one chromosome at a time. When pairing up multiple chromosomes per chunk, sizes (in terms of numbers of CpGs) will be taken into consideration to balance the sizes of each chunk.

Value

a GRanges object that contains the results of the inference. The object contains one row for each candidate region, sorted by q-value and then chromosome. The standard GRanges chr, start, and end are included, along with at least 7 metadata columns, in the following order: 1. L = the number of CpGs contained in the region, 2. area = the sum of the smoothed beta values 3. beta = the coefficient value for the condition difference (there will be more than one column here if a multi-group comparison was performed), 4. stat = the test statistic for the condition difference, 5. pval = the permutation p-value for the significance of the test statistic, and 6. qval = the q-value for the test statistic (adjustment for multiple comparisons to control false discovery rate). 7. index = an IRanges containing the indices of the region's first CpG to last CpG.

Examples


# load example data 
data(BS.chr21)

# the covariate of interest is the 'CellType' column of pData(BS.chr21)
testCovariate <- 'CellType'

# run dmrseq on a subset of the chromosome (10K CpGs)
regions <- dmrseq(bs=BS.chr21[240001:250000,],
                 cutoff = 0.05,
                 testCovariate=testCovariate)


kdkorthauer/dmrseq documentation built on Sept. 26, 2024, 9:32 p.m.