In zwdzwd/sesame: SEnsible Step-wise Analysis of DNA MEthylation BeadChips

options(rmarkdown.html_vignette.check_title = FALSE)

Preparation

library(sesame)

## As sesame and sesameData are under active development, this documentation is
## specific to the following version of R, sesame, sesameData and ExperimentHub:
sesame_checkVersion()

We recommend updating your R, ExperimentHub, sesameData and sesame to use this documentation consistently. If you have installed directly from github, please make sure the compatible ExperimentHub is installed.

If you use a previous version, please checkout the vignette that corresponds to the right version here https://zhou-lab.github.io/sesame/dev/supplemental.html#Versions

CRITICAL: After a new installation, one must cache the associated annotation data using the following command. This needs to be done only once per SeSAMe installation/update. Caching data to local guarantees proper data retrieval and saves internet traffic.

sesameDataCache()

This function caches the needed SeSAMe annotations. SeSAMe annotation data is managed by the sesameData package which uses the ExperimentHub infrastructure. You can find the location of the cached annotation data on your local computer using:

tools::R_user_dir("ExperimentHub", which="cache")

The openSesame Pipeline

The openSesame function provides end-to-end processing that converts IDATs to DNA methylation level (aka $\beta$ value) matrices in R. The function input path_to_idats can be one of the following input:

A path to a directory where the IDAT files will be recursively search
Specific path(s) of IDAT file prefix, one for each IDAT pair
One or a list of SigDF objects

The following code uses a directory that contains built-in two HM27 IDAT pairs to demonstrates the use of openSesame:

betas = openSesame("path_to_idats", BPPARAM = BiocParallel::MulticoreParam(2))

The BPPARAM= option is from the BiocParallel package and controls parallel processing (in this case, we are using two cores). Under the hood, the function performs a series of tasks including: searching IDAT files from the directory (the searchIDATprefixes function), reading IDAT data in as SigDF objects (the readIDATpair function), preprocessing the signals (the prepSesame function), and finally converting them to DNA methylation levels ($\beta$ values, the getBetas function). Alternatively, one can run the following command to get the same results, while gaining more refined control:

##  The above openSesame call is equivalent to:
betas = do.call(cbind, BiocParallel::bplapply(
    searchIDATprefixes(idat_dir), function(pfx) {
        getBetas(prepSesame(readIDATpair(pfx), "QCDPB"))
}, BPPARAM = BiocParallel::MulticoreParam(2)))

## or even more explicitly (if one needs to control argument passed
## to a specific preprocessing function)
betas = do.call(cbind, BiocParallel::bplapply(
    searchIDATprefixes(idat_dir), function(pfx) {
        getBetas(noob(pOOBAH(dyeBiasNL(inferInfiniumIChannel(qualityMask(
            readIDATpair(pfx)))))))
}, BPPARAM = BiocParallel::MulticoreParam(2)))

The openSesame function is highly customizable. The prep= argument is the same argument one gives to the prepSesame function (see [Data Preprocessing] for detail) which openSesame calls internally. The argument uniquely specifies a preprocessing procedure. The func= option specifies the signal extraction function. It can be either be getBetas (DNA methylation) or getAFs (allele frequencies of SNP probes) or NULL (returns SigDF). The manifest= option allows one to provide an array manifest when handling data from platform not supported natively. Finally, the BPPARAM= argument is the same argument taken by BiocParallel::bplapply to allow parallel processing. See Supplemental Vignette for details of these component functions of openSesame.

The output of openSesame can also be customized. It can either be beta values, which are the end DNA methylation readings, as shown above. It can also be a list of SigDFs which stores the signal intensities and can be further put back to openSesame for more processing. The openSesame(func=) argument specifies whether the output is a SigDF list or beta values. The following shows some usage:

betas = openSesame(idat_dir, func = getBetas) # getBetas is the default
sdfs = openSesame(idat_dir, func = NULL) # return SigDF list
allele_freqs = openSesame(idat_dir, func = getAFs) # SNP allele frequencies
sdfs = openSesame(sdfs, prep = "Q", func = NULL)   # take and return SigDFs

One can also generate the detection p-values (e.g., for GEO upload) by feeding openSesame with func = pOOBAH (or other detection p-value calculators).

pvals = openSesame(idat_dir, func = pOOBAH, return.pval=TRUE)

Data Preprocessing {#dataprep}

The prep= argument instructs the openSesame function to call the prepSesame function to preprocess signal intensity under the hood. This can be skipped by using prep="". The prepSesame function takes a single SigDF as input and returns a processed SigDF. When prep= is non-empty, it selects the preprocessing functions (see Preprocessing Function Code) and specifies the order of their execution. For example,

sdf = sesameDataGet('EPIC.1.SigDF')
sdf_preped = openSesame(sdf, prep="DB", func=NULL)

performs dye bias correction (D) followed by background subtraction (B). In other words, prepSesame(sdf, "DB") is equivalent to noob(dyeBiasNL(sdf)). All the preprocessing functions take a SigDF as input and return an updated SigDF. Therefore, these functions can be chained together. The choice of preprocessing functions and the order of their chaining is important (see Supplemental Vignette) for detailed discussions of these functions). The following table lists the best preprocessing strategy based on our experience.

library(knitr)
df <- data.frame(rbind(
    c("EPICv2/EPIC/HM450", "human", "QCDPB"),
    c("EPICv2/EPIC/HM450", "non-human organism", "SQCDPB"),
    c("MM285", "mouse", "TQCDPB"),
    c("MM285", "non-mouse organism", "SQCDPB"),
    c("Mammal40", "human", "HCDPB"),
    c("Mammal40", "non-human organism", "SHCDPB")))
colnames(df) <- c("Platform", "Sample Organism", "Prep Code")
kable(df, caption="Recommended Preprocessing")

The optimal strategy of preprocessing depends on:

1) The array platform. For example, certain array platforms (e.g., the Mammal40) do not have enough Infinium-I probes for background estimation and dye bias correction, therefore background subtraction (where the out-of-band signals are from) might not work most optimally;

2) The expected sample property. For example, some samples have the signature bimodal distribution of methylation of most mammalian cells. Others may undergo global loss of methylation (germ cells, tumors etc). Other important factors include high-input vs low-input, tumor vs normal, somatic vs germ cells, human vs model organisms, mouse strains etc. Some platforms (e.g., Mammal40 and MM285) are designed for multiple species and strains. Therefore S and T would be important when those arrays are used on non-reference organisms (see Working with Nonhuman Arrays).

Preprocessing Function Code {#prep}

The prepSesameList function lists all the available codes and the associated preprocessing functions.

prepSesameList()

Here are some consideration when determining the preprocessing order. Species (S) and strain (T) inference resets the mask and color channels based on probe alignment and presence of genetic variants. Therefore when they are used, they need to be called first. Q masks non-uniquely mapped probes which may inflate the out-of-band signal for background estimation. Therefore Q should be used before detection p-value calculation (P) and background subtraction (B) when necessary. Channel inference (C) and dye bias correction (D) should take place early since dye bias effect is global. C should be placed before D because dye bias correction uses in-band signal the identification of which relies on correct channel designation. Detection p-value (P) should happen before background subtraction (B) since background subtraction modifies signal and may affect out-of-band signal assumption used in P. Lastly, functions that explicitly normalizes $\beta$ value distribution (M) should happen last if they even need to be used.

See Supplemental Vignette for details of preprocessing functions.

Lift over across platforms

To allow data integration across platforms, one can harmonize beta values, probe ID list and even signal intensities, to a different platform using the mLiftOver (methylation Lift Over) utility. The following examples illustrates these corresponding usages.

Project probe IDs

cg_msa = names(sesameData_getManifestGRanges("MSA"))
## only mappable probes, return mapping from MSA to HM450
head(mLiftOver(cg_msa, "HM450"))

cg_hm450 = names(sesameData_getManifestGRanges("HM450"))
cg_hm450 = grep("cg", cg_hm450, value=TRUE)
## only mappable probes, return mapping from HM450 to EPICv2
head(mLiftOver(cg_hm450, "EPICv2"))

Project beta values

betas = openSesame(sesameDataGet("EPICv2.8.SigDF")[[1]])
betas_epic = mLiftOver(betas, "EPIC", impute=FALSE)
length(betas_epic)     # EPIC platform dimension
sum(is.na(betas_epic)) # NA values are present

betas_epic = imputeBetas(betas_epic)
length(betas_epic)     # EPIC platform dimension
sum(is.na(betas_epic)) # expect 0 NA after imputation

## use empirical evidence in mLiftOver
mapping = sesameDataGet("liftOver.EPICv2ToEPIC")
betas_matrix = openSesame(sesameDataGet("EPICv2.8.SigDF")[1:2])
dim(mLiftOver(betas_matrix, "EPIC", mapping = mapping))
## compare to without using empirical evidence
dim(mLiftOver(betas_matrix, "EPIC"))

Project signal SigDFs

sdf = sesameDataGet("EPICv2.8.SigDF")[["GM12878_206909630042_R08C01"]]
dim(mLiftOver(sdf, "EPICv2")) # EPICv2 platform dimension
dim(mLiftOver(sdf, "EPIC"))   # EPIC platform dimension
dim(mLiftOver(sdf, "HM450"))  # HM450 platform dimension

Collapse to cg prefixes

The more recent Infinium arrays (the mouse, EPICv2, and MSA arrays) have suffixes to uniquely identify the probe design. If one prefers to collapse methylation readings to the cg prefix so they are comparable to the previous array generations, this can be done with getBetas(..., collapseToPfx = TRUE). The same argument can be passed to openSesame(..., func=getBetas, collapseToPfx=TRUE)

betas = getBetas(sdf_from_EPICv2, collapseToPfx = TRUE)
## or
betas = openSesame("path_to_idats", collapseToPfx = TRUE)
## by default the method for collapsing is to make means
betas = openSesame("path_to_idats", collapseToPfx = TRUE, collapseMethod = "mean")
## one can also switch to min detection p-value
betas = openSesame("path_to_idats", collapseToPfx = TRUE, collapseMethod = "minPval")

One may also use the betasCollapseToPfx function to generate the traditional cg number readings.

betas = betasCollapseToPfx(betas, BPPARAM=BiocParallel::MulticoreParam(2))

Please find more info at the Supplemental documentation at the following [link] (https://zhou-lab.github.io/sesame/dev/supplemental.html#Replicate_Probes)