library(wheatmap) library(dplyr) options(rmarkdown.html_vignette.check_title = FALSE)
SeSAMe provides extensive native support for the Illumina Mouse Methylation BeadChip array (referred to as the MM285 or MMB array) and the Horvath Mammal40 array (refered to as the Mammal40 array). The MM285 array contains ~285,000 probes covering over 20 design categories including gene promoters, enhancers, CpGs in synteny to human EPIC array as well as other biology. In SeSAMe, MM285 is used as the product abbreviation to distinguish future platforms including MM320. This documents describe the procedure to process the MM285 and the Mammal40 array.
We first load required library and perform sesame data caching (only needed at new SeSAMe installation).
library(sesame) sesameDataCache()
SeSAMe supports automatic inference of the profiled organism. This is achieved
by the inferSpecies
function. Usually, users need not call this function
explicitly and only need to specify the code S
as part of the 2nd argument of
the openSesame
function. See the Basics Usage vignette
for more detail.
The following example downloads an example Mammal40 array IDAT pair and call
openSesame
function with species inference (note the S
in the prep=
argument).
Download test IDATs
GSM4411982_Grn.idat.gz
and GSM4411982_Red.idat.gz
from
https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test
betas = openSesame(sprintf("~/Downloads/GSM4411982", tmp), prep="SHCDPM")
The above code is equivalent to
## equivalent to the above openSesame call betas = getBetas(matchDesign(pOOBAH(dyeBiasNL(inferInfiniumIChannel( prefixMaskButC(inferSpecies(readIDATpair( "~/Downloads/GSM4411982"))))))))
As can be seen, inferSpecies
takes a SigDF as input and outputs an updated
SigDF which contains a species-specific masking and color channel
designation. This information is key to proper preprocessing since knowledge of
the color channel designation and probe hybridization success is the
foundational assumption of many preprocessing algorithms. One may instruct the
function to return the inferred species information by using the
return.species = TRUE
argument. The following example shows this usage:
sdf = sesameDataGet("Mammal40.1.SigDF") # an example SigDF inferSpecies(sdf, return.species = TRUE)
Internally, we used a nonparametric scoring method to infer the most likely
species from a pool of 310 candidate species from Ensembl (v101). The
return.auc = TRUE
argument allows one to peek into the AUC (Area Under the
Curve) score generated in this inference. The higher the score, the more likely
the data is from the corresponding species. Knowing the scores can help
diagnose misclassifications such as when several candidate species are closely
related and hard to discriminate from data.
## showing the candidate species with top scores head(sort(inferSpecies(sdf, return.auc = TRUE), decreasing=TRUE))
If the user believes that automatic inference gives wrong (most often still
close-related) species, one can force species inference to a target species by
using the updateSigDF
function. For example, the following code forces the
SigDF
to be treated as a mus_musculus
sample. Note this doesn't alter
signal intensity but only change the probe masking and color channel spec (the
view of the data, not the data reading itself).
sdf_mouse <- updateSigDF(sdf, species="mus_musculus")
CRITICAL: Since updateSigDF
function resets the whole mask and col column
of SigDF. One should use this function (and inferSpecies
) before other
preprocessing functions that sets mask and col.
Like species inference, strain-specific masking and preprocessing is important
for mouse array samples. This is achieved by the inferStrain
function. The
function is represented by the T
code in openSesame
/prepSesame
. The
following example shows how to use inferStrain
in openSesame
. Note the use
of T
in the prep code.
Download test IDATs
204637490002_R05C01_Grn.idat
and 204637490002_R05C01_Red.idat
from
https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Test
betas = openSesame("~/Downloads/204637490002_R05C01", prep="TQCDPB")
Like inferSpecies
, one need to call the inferStrain
function before calling
the standard noob
, dyeBiasNL
, etc (by having T
before QCDPB
when
calling openSesame
). Also like inferSpecies()
, inferStrain()
will return
a new SigDF
with col and mask updated to reflect the change of
strain. Optionally, one can also specify return.strain=TRUE
,
return.pval=TRUE
or return.probability=TRUE
to return the inferred strain,
the p-value or the probabilities of all 37 strain candidates. Internally, the
function converts the beta values to variant allele frequencies. It should be
noted that since variant allele frequency is not always measured by the
M-allele of the probe. SeSAMe flips the $\beta$ values for some probes to
calculate variant allele frequency. The following example shows what
inferStrain
does to a SigDF
:
sdf = sesameDataGet("MM285.1.SigDF") # an example dataset inferStrain(sdf, return.strain = TRUE) # return strain as a string sdf_after = inferStrain(sdf) # update mask and col by strain inference sum(sdf$mask) # before strain inference sum(sdf_after$mask) # after strain inference
Let's visualize the probabilities of all candidate strains using the
return.probabilities
option:
library(ggplot2) p = inferStrain(sdf, return.probability = TRUE) df = data.frame(strain=names(p), probs=p) ggplot(data = df, aes(x = strain, y = probs)) + geom_bar(stat = "identity", color="gray") + ggtitle("Strain Probabilities") + ylab("Probability") + xlab("") + scale_x_discrete(position = "top") + theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0), legend.position = "none")
See also The Supplemental Vignette for heatmap validation of strain inference.
In the MM285 array, we designed multimapping probes to allow querying
transposable elements and other biology. We also exposed probes with
potentially design flaws. These suboptimally designed probes take a new probe
ID prefix ("uk") in addition to the "cg"/"ch"/"rs" typically seen. By default
the repeat and suboptimally designed probes are masked by NA
. These masking
is done by the qualityMask
function (or Q
in prep codes). To override
masking these probes, one can use the resetMask
function (the 0
code in
openSesame
) to remove the default masking. We recommend using it after the
preprocessing function that depends on reliable/uniquely-mapped probes, but
before detection p-value based masking (e.g. pOOBAH). This way, probes that
fail detection can still be flagged (they should be).
sdf = sesameDataGet('MM285.1.SigDF') sum(is.na(openSesame(sdf, prep="TQCDPB"))) sum(is.na(openSesame(sdf, prep="TQCD0PB")))
UNDER CONSTRUCTION
There are other inferences one can do on the nonhuman arrays, e.g., sex, age (epigenetic clocks), tissue, copy number alteration etc. These will be elaborated in The Inference Vignette.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.