suppressMessages(library(knitr)) suppressMessages(library(GenomicRanges)) suppressMessages(library(limma)) suppressMessages(library(minfi)) suppressMessages(library(ExperimentHub)) suppressMessages(library(recountmethylation)) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE)
recountmethylation
is an R/Bioconductor package providing resources to access
and analyze compilations of public DNA methylation (DNAm) array data from the
Gene Expression Omnibus (GEO). The database compilation files span two array
platforms and include mined, mapped, and model-based sample metadata. The DNAm
signals can be accessed in a variety of formats and data storage types. This
User's Guide shows how to use the recountmethylation
package, including
crucial background about the platforms and datatypes, and runnable examples using
2 small example files. Additional info and more advanced analysis examples are
contained in other package vignettes.
The recountmethylation
resource now includes three compilation versions,
detailed in the table below. The initial versions only included samples run
using the HM450K platform, while newer versions also included samples run using
the EPIC platform. These compilations currently include 93,306 samples run on the HM450K
platform, 38,122 samples run on the EPIC platform, and 131,428 total samples.
dft <- data.frame(release = c("first", "second", "third", "total"), version.label = c("0.0.1", "0.0.2", "0.0.3", "all"), date = c("11/20/2020", "01/06/2021", "12/21/2022", "12/21/2022"), hm450k.samples = c(35360, 50400, 7546, sum(c(35360, 50400, 7546))), epic.samples = c(0, 12650, 25472, sum(c(0, 12650, 25472)))) dft$combined.samples <- dft$hm450k.samples + dft$epic.samples knitr::kable(dft, align = "c")
Database compilation file download and access is managed by the get_db
functions, where the DNAm array platform type using the platform
argument
(see ?get_db
for details). Both HM450K and EPIC/HM850K platforms are
currently supported (see below for platform details). Note you will need
between 50-180 Gb of disk space to store a single database file. Files pair
sample metadata and assay data in various formats, including HDF5-SummarizedExperiment
database directories, and HDF5
database files with the .h5
extension.
The databases are located at https://methylation.recount.bio/, and file details are viewable as follows:
sm <- as.data.frame(smfilt(get_servermatrix())) if(is(sm, "data.frame")){knitr::kable(sm, align = "c")}
The DNAm array database files are indexed on ExperimentHub
, and are
viewable as follows. Note, the cache needs to be set with R_user_dir()
per instructions here.
cache.path <- tools::R_user_dir("recountmethylation") setExperimentHubOption("CACHE", cache.path) hub <- ExperimentHub::ExperimentHub() # connect to the hubs rmdat <- AnnotationHub::query(hub, "recountmethylation") # query the hubs
In addition to using the getdb
functions, the HDF5
(".h5"" extension)
files may be downloaded from the hubs.
fpath <- rmdat[["EH3778"]] # download with default caching rhdf5::h5ls(fpath) # load the h5 file
Note that whether downloads use the hubs or getdb
functions, caching
is implemented to check for previously downloaded database files.
Please note the following disclaimer, which also shows when recountmethylation
is loaded:
Databases accessed with `recountmethylation` contain data from GEO (ncbi.nlm.nih.gov/geo/), a live public database where alterations to online records can cause discrepancies with stored data over time. We cannot guarantee the accuracy of stored data, and advise users cross-check their findings with latest available records.
This section includes essential background about DNAm array platforms, assays and file types, and sample metadata.
Databases include human samples run on the Illumina Infinium HM450K BeadArray platform. HM450K is a popular 2-channel platform that probes over 480,000 CpG loci genome-wide, with enriched coverage at CG islands, genes, and enhancers @sandoval_validation_2011. The more recently released EPIC/HM850K platform contains an expanded probe set targeting over 850,000 CpGs, including more than 90% of the HM450K probes, with greater coverage of potential intergenic regulatory regions @pidsley_critical_2016.
Array processing generates 2 intensity files (IDATs) per sample, one each for
the red and green color channels. These raw files also contain control signals
useful for quality evaluations @noauthor_illumina_2010. The BeadArray probes use
either of 2 bead technologies, known as Type I and Type II, where the majority
(72%) of probes use the latter. For Type II probes, a single bead assay informs
a single probe, while Type I probes use 2 beads each. Practically, this means
the bead-specific matrices found in RGChannelSet
objects are larger than the
probe-specific matrices found in derived object types (e.g. for HM450K samples,
622,399 assays for red/green signal matrices versus 485,512 assays for
methylated/unmethylated signal, DNAm fractions matrices, see below).
SummarizedExperiment
object classesDNAm array sample IDATs can be read into an R session as an object of class
RGChannelSet
, a type of SummarizedExperiment
. These objects support
analyses of high-throughput genomics datasets, and they include slots for
assay matrices, sample metadata, and experiment metadata. During a typical
workflow, normalization and preprocessing convert RGChannelSet
objects into
new types like MethylSet
and RatioSet
. While not all IDAT information is
accessible from every object type (e.g. only RGChannelSet
s can contain
control assays), derived objects like MethylSet
s and RatioSet
s may be
smaller and/or faster to access.
Three SummarizedExperiment
databases are provided as
HDF5-SummarizedExperiment
files, including an unnormalized RGChannelSet
(red/green signals), an unnormalized MethylSet
(methylated/unmethylated
signals) and a normalized GenomicRatioSet
(DNAm fractions). For the latter,
DNAm fractions (logit2 Beta-values, or M-values) were normalized using the
out-of-band signal or "noob" method, an effective within-sample normalization
that removes signal artifacts @triche_low-level_2013.
Database files are stored as either HDF5
or HDF5-SummarizedExperiment
. For
most R users, the latter files will be most convenient to work with. HDF5
, or
hierarchical data format 5, combines compression and chunking for convenient
handling of large datasets. HDF5-SummarizedExperiment
files combine the
benefits of HDF5
and SummarizedExperiment
entities using a
DelayedArray-powered backend. Once an HDF5-SummarizedExperiment
file is
loaded, it can be treated similarly to a SummarizedExperiment
object in
active memory. That is, summary and subset operations execute rapidly, and
realization of large data chunks in active memory is delayed until called for
by the script (see examples).
Sample metadata are included with DNAm assays in the database files. Currently,
metadata variables include GEO record IDs for samples (GSM) and studies (GSE),
sample record titles, learned labels for tissue and disease, sample type
predictions from the MetaSRA-pipeline, and DNAm model-based predictions for
age, sex, and blood cell types. Access sample metadata from
SummarizedExperiment
objects using the pData
minfi function (see examples).
Examples in the data_analyses
vignette illustrate some ways to utilize the
provided sample metadata.
Provided metadata derives from the GSE-specific SOFT files, which contain
experiment, sample, and platform metadata. Considerable efforts were made to
learn, harmonize, and predict metadata labels. Certain types of info lacking
in the recountmethylation
metadata may be available in the SOFT files,
especially if it is sample non-specific (e.g. methods text, PubMed ID, etc.)
or redundant with DNAm-derived metrics (e.g. DNAm summaries, predicted sex,
etc.).
It is good practice to validate the harmonized metadata with original metadata records, especially where labels are ambiguous or there is insufficient information for a given query. GEO GSM and GSE records can be viewed from a browser, or SOFT files may be downloaded directly. Packages like GEOmetadb and GEOquery are also useful to query and summarize GEO metadata.
HDF5-SummarizedExperiment
exampleThis example shows basic handling for HDF5-SummarizedExperiment
(a.k.a.
"h5se") files. For these files, the getdb
function returns the loaded file.
Thanks to a DelayedArray
backend, even full-sized h5se
databases can be
treated as if they were fully loaded into active memory.
The test h5se
dataset includes sample metadata and noob-normalized
DNAm fractions (Beta-values) for chromosome 22 probes for 2 samples.
Datasets can be downloaded using the getdb
series of functions
(see ?getdb
for details), where the dfp
argument specifies the
download destination. The test h5se
file is included in the package
"inst" directory, and can be loaded as follows.
dn <- "remethdb-h5se_gr-test_0-0-1_1590090412" path <- system.file("extdata", dn, package = "recountmethylation") h5se.test <- HDF5Array::loadHDF5SummarizedExperiment(path)
Common characterization functions can be used on the dataset after it has been
loaded. These include functions for SummarizedExperiment
-like objects, such
as the getBeta
, pData
, and getAnnotation
minfi functions. First, inspect
the dataset using standard functions like class
, dim
, and summary
as
follows.
class(h5se.test) # inspect object class
dim(h5se.test) # get object dimensions
summary(h5se.test) # summarize dataset components
Access the sample metadata for the 2 available samples using pData
.
h5se.md <- minfi::pData(h5se.test) # get sample metadata dim(h5se.md) # get metadata dimensions
colnames(h5se.md) # get metadata column names
Next get CpG probe-specific DNAm fractions, or "Beta-values", with getBeta
(rows are probes, columns are samples).
h5se.bm <- minfi::getBeta(h5se.test) # get dnam fractions dim(h5se.bm) # get dnam fraction dimensions
colnames(h5se.bm) <- h5se.test$gsm # assign sample ids to dnam fractions knitr::kable(head(h5se.bm), align = "c") # show table of dnam fractions
Access manifest information for probes with getAnnotation
. This includes the
bead addresses, probe type, and genome coordinates and regions. For full details
about the probe annotations, consult the minfi and Illumina platform documentation.
an <- minfi::getAnnotation(h5se.test) # get platform annotation dim(an) # get annotation dimensions
colnames(an) # get annotation column names
ant <- as.matrix(t(an[c(1:4), c(1:3, 5:6, 9, 19, 24, 26)])) # subset annotation knitr::kable(ant, align = "c") # show annotation table
HDF5
database and exampleTo provide more workflow options, bead-specific red and green signal data have
been provided with sample metadata in an HDF5
/h5
file. This example shows
how to handle objects of this type with recountmethylation
.
The test h5
file includes metadata and bead-specific signals from
chromosome 22 for the same 2 samples as in the h5se
test file.
Note getdb
functions for h5
files simply return the database path.
Since the test h5
file has also been included in the package "inst" folder,
get the path to load the file as follows.
dn <- "remethdb-h5_rg-test_0-0-1_1590090412.h5" # get the h5se directory name h5.test <- system.file("extdata", "h5test", dn, package = "recountmethylation") # get the h5se dir path
Use the file path to read data into an RGChannelSet
with the getrg
function. Setting all.gsm = TRUE
obtains data for all samples in the
database files, while passing a vector of GSM IDs to gsmv
argument
will query a subset of available samples. Signals from all available
probes are retrieved by default, and probe subsets can be obtained by
passing a vector of valid bead addresses to the cgv
argument.
h5.rg <- getrg(dbn = h5.test, all.gsm = TRUE) # get red/grn signals from an h5 db
To avoid exhausting active memory with the full-sized h5
dataset, provide
either gsmv
or cgv
to getrg
, and set either all.cg
or all.gsm
to
FALSE (see ?getrg
for details).
As in the previous example, use pData
and getAnnotation
to get sample
metadata and array manifest information, respectively. Access the green and
red signal matrices in the RGChannelSet
with the getRed
and getGreen
minfi functions.
h5.red <- minfi::getRed(h5.rg) # get red signal matrix h5.green <- minfi::getGreen(h5.rg) # get grn signal matrix dim(h5.red) # get dimensions of red signal matrix
knitr::kable(head(h5.red), align = "c") # show first rows of red signal matrix
knitr::kable(head(h5.green), align = "c") # show first rows of grn signal matrix
identical(rownames(h5.red), rownames(h5.green)) # check cpg probe names identical
Rows in these signal matrices map to bead addresses rather than probe IDs.
These matrices have more rows than the h5se
test Beta-value matrix because
any type I probes use data from 2 beads each.
This section demonstrates validation using the test databases. Full code to reproduce this section is provided but not evaluated, as it involves a download from the GEO servers. As the disclaimer notes, it is good practice to validate data against the latest available GEO files. This step may be most useful for newer samples published close to the end compilation date (through November 7, 2020 for current version), which may be more prone to revisions at initial publication.
Use the gds_idat2rg
function to download IDATs for the 2 test samples
and load these into a new RGChannelSet
object. Do this by passing a vector
of GSM IDs to gsmv
and the download destination to dfp
. (note, chunks in
this section are fully executable, but not evaluated for this vignette).
# download from GEO dlpath <- tempdir() # get a temp dir path gsmv <- c("GSM1038308", "GSM1038309") # set sample ids to identify geo.rg <- gds_idat2rg(gsmv, dfp = dlpath) # load sample idats into rgset colnames(geo.rg) <- gsub("\\_.*", "", colnames(geo.rg)) # assign sample ids to columns
Extract the red and green signal matrices from geo.rg
.
geo.red <- minfi::getRed(geo.rg) # get red signal matrix geo.green <- minfi::getGreen(geo.rg) # get grn signal matrix
Match indices and labels between the GEO and h5
test signal matrices.
int.addr <- intersect(rownames(geo.red), rownames(h5.red)) # get probe address ids geo.red <- geo.red[int.addr,] # subset geo rgset red signal geo.green <- geo.green[int.addr,] # subset gro rgset grn signal geo.red <- geo.red[order(match(rownames(geo.red), rownames(h5.red))),] geo.green <- geo.green[order(match(rownames(geo.green), rownames(h5.green))),] identical(rownames(geo.red), rownames(h5.red)) # check identical addresses, red identical(rownames(geo.green), rownames(h5.green)) # check identical addresses, grn class(h5.red) <- "integer"; class(h5.green) <- "integer" # set matrix data classes to integer
Finally, compare the signal matrix data.
identical(geo.red, h5.red) # compare matrix signals, red
identical(geo.green, h5.green) # compare matrix signals, grn
Before comparing the GEO-downloaded data to data from the h5se.test
database,
normalize the data using the same out-of-band or "noob" normalization technique
that was used to generate data in the h5se
database.
geo.gr <- minfi::preprocessNoob(geo.rg) # get normalized se data
Next, extract the Beta-values.
geo.bm <- as.matrix(minfi::getBeta(geo.gr)) # get normalized dnam fractions matrix
Now match row and column labels and indices.
h5se.bm <- as.matrix(h5se.bm) # set dnam fractions to matrix int.cg <- intersect(rownames(geo.bm), rownames(h5se.bm)) geo.bm <- geo.bm[int.cg,] # subset fractions on shared probe ids geo.bm <- geo.bm[order(match(rownames(geo.bm), rownames(h5se.bm))),]
Finally, compare the two datasets.
identical(summary(geo.bm), summary(h5se.bm)) # check identical summary values
identical(rownames(geo.bm), rownames(h5se.bm)) # check identical probe ids
This section describes how to address potential issues with accessing the
database files or working with the DelayedArray
based objects locally.
If repeated attempts to download the database compilation files fail, you may try the following:
First ensure your internet connection is stable and there is sufficient space at the download destination for the database file.
Second, try increasing your timeout duration beyond the default before
repeating the download attempt with getdb
. Check the current timeout
for an R session with getOptions('timeout')
, then manually increase
the timeout duration with options(timeout = new.time)
.
Finally, you may attempt to download a server file using command line
calls to your system terminal or console. For instance, on a Mac you
might try wget -r <file_url>
. If this doesn't work, you can again
attempt to increase the timeout duration and repeat the download attempt.
DelayedArray
inputsUnexpected function behaviors may arise when using DelayedArray
-based inputs.
These essentially arise from lacking interoperativity between normal matrices
and the DelayedArray
-based matrices. Known examples include:
minfi::detectionP()
: Throws error for specific subsets of data, such as for queries of exactly 50 samples.
detectionP(rg[,1:50]) # get detection pvalues from rgset "Error in .local(Red, Green, locusNames, controlIdx, TypeI.Red, TypeI.Green, dim(Red_grid) == dim(detP_sink_grid) are not all TRUE"
minfi::preprocessFunnorm()
: Throws error when called for an RGChannelSet
of type HDF5-SummarizedExperiment
.
preprocessFunnorm(rg) # get noob-normalized data "Error: 'preprocessFunnorm()' only supports matrix-backed minfi objects.""
These and other related errors may be addressed by instantiating the data query,
or the data chunk, as a new non-DelayedArray
object. For example, remake a
subset of the full h5se
dataset, rg
, as follows.
rg.h5se <- loadHDF5SummarizedExperiment(rg.path) # full h5se RGChannelSet rg.sub <- rg.h5se[,c(1:20)] # subset samples of interest rg.new <- RGChannelSet(Red = getRed(rg.sub), Green = getGreen(rg.sub), annotation = annotation(rg.sub)) # re-make as non-DA object gr <- preprocessFunnorm(rg.new) # repeat preprocessing
Alternatively, non-DelayedArray
RGChannelSet
objects can be readily generated from
the full h5
RGChannelSet
database with the provided function getrg()
.
Consult the Data Analyses vignette and main manuscript for analysis examples and details about data compilations.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.