require(knitr) opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
BiocStyle::markdown()
The r Biocpkg("FlowSorted.Blood.WGBS.BLUEPRINT")
package
provides a R / Bioconductor resource for representing and
manipulating N=44 flow sorted purified cell types from whole blood
measured on a whole genome bisulfite-sequencing (WGBS) platform
generated by the BLUEPRINT consortium (http://www.blueprint-epigenome.eu).
This package makes extensive use of the r Biocpkg("HDF5Array")
package to avoid loading the entire data set in memory, instead
storing the methylated and coverage reads on disk as HDF5 files
and loading subsets of the data into memory upon request.
We use the FlowSorted.Blood.WGBS.BLUEPRINT()
function to
download the relevant files from Bioconductor's ExperimentHub
web resource. This includes the HDF5 files containing the
number of methylated reads (M
) and total reads (Cov
), as
well as the genomic locations of the CpGs and metadata on the
columns (samples). The output is a single BSseq
object
from the r Biocpkg("bsseq")
package. This is equivalent
to a SummarizedExperiment
class but with a number of
features specific to DNA methylation data.
library(FlowSorted.Blood.WGBS.BLUEPRINT) bs <- FlowSorted.Blood.WGBS.BLUEPRINT() bs
The colData
containing the phenotypic information
about each sample can be extracted using
colData(bs)
The coverage matrix (total number of reads or Cov
)
and methylation matrix (number of methylated reads or
M
) is represented as a DelayedMatrix
from the
r Biocpkg("DelayedArray")
package. This wraps the
underlying HDF5 file in a container that can be manipulated
in R.
hdf5_cov <- getCoverage(bs, type = "Cov") hdf5_meth <- getCoverage(bs, type = "M") hdf5_cov hdf5_meth
To quickly explore the data set, we compute some summary
statistics on the coverage matrix. We tell the
r Biocpkg("DelayedArray")
block size to indicate that we
can use up to 1 GB of memory for loading the data into
memory from disk.
options(DelayedArray.block.size=1e9)
We are interested in removing all CpGs that have no coverage in at least one sample, a naive implementation might be
keep_ids <- (DelayedArray::rowSums(hdf5_cov == 0) == 0) table(keep_ids)
We see that results in about r sum(keep_ids) / 1e6
million
CpGs. Now, we can reduce the size of the object to
bs.filtered <- bs[keep_ids,]
bs.filtered
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.