suppressPackageStartupMessages({ suppressMessages({ library(restfulSE) library(rhdf5client) library(lihc450k) library(curatedTCGAData) library(EnsDb.Hsapiens.v75) # GRCh37 library(FDb.InfiniumMethylation.hg19) library(GenomeInfoDb) library(MethylMix) }) })
The r Biocpkg("curatedTCGAData")
package delivers instances of the MultiAssayExperiment
class, as defined in the r Biocpkg("MultiAssayExperiment")
package. When the
set of assays requested includes the DNA methylation data as assayed by the
Illumina 450K array, the standard serialization can be difficult to work with.
For the TCGA LIHC data, the serialized, compressed RDA approaches 1GB on disk.
It is natural to consider out-of-memory approaches to working with such data, but
the practical implications of this are not completely clear. The lihc450k
package
is experimental, with development at github.com/vjcitn/lihc450k
, making reference
to serializations of metadata and assay data that are lodged in public AWS S3 buckets,
and in a public instance of the HDF Object Store (known as HSDS) in a server
provided by John Readey of the HDF Group.
In this document we will illustrate how lihc450k
can be used to explore the
implications of hybrid approaches to representation of genome-scale assays. We use
r Biocpkg("BiocFileCache")
to manage persistent images of SummarizedExperiment
metadata and "local" HDF5. Ultimately we will illustrate how to use a hybrid
MultiAssayExperiment to carry out tasks of identifying transcriptionally predictive
DNA methylation states in the r Biocpkg("MethylMix")
package.
Nota bene: The first time you run lim = loadLIHC450k(useHSDS=FALSE)
,
a 1.2 GB HDF5 representation of the 450k data will be downloaded to
your BiocFileCache. This can take a while depending on network
througput. Once this task has completed successfully, this
function will run quickly. (To avoid the download, set useHSDS
to
TRUE
, in which case queries for 450k data will use the remote HSDS.)
We will use the loadLIHC450k
function to create a SummarizedExperiment
instance for the 450k data from the TCGA-LIHC (liver, hepatocellular carcinoma)
cohort. By specifying useHSDS=FALSE
, we indicate that we want to use
local HDF5 to manage the 450k data. This entails that BiocFileCache
will
be populated with the HDF5 representation of 396065 x 421
methylation
quantifications (beta values). Note that the quantifications were retrieved
using the download utility of MethylMix
and that missing values are present
in the data as delivered by Broad Firehose. In preparing the HDF5,
features for which no sample
provided a quantification were manually removed, and sporadically missing
values were replaced by values obtained by the impute.knn
function of the
r CRANpkg("impute")
package.
library(lihc450k) lim = loadLIHC450k(useHSDS=FALSE) # local
RNASeq2GeneNorm
expression dataWe want to work with matched expression and methylation data.
The MultiAssayExperiment
function uses a sampleMap
DataFrame
to link the samples.
lim_map = DataFrame(assay=factor(rep("lihc450k", ncol(lim))), primary=substr(colnames(lim),1,12), colname=colnames(lim))
We acquire the expression data. This will use the BiocFileCache if this is not the first time requesting these data, or will download to populate the cache if it is the first time requesting the data.
suppressMessages({ library(curatedTCGAData) lihcrna = curatedTCGAData("LIHC", "RNASeq2GeneNorm", dry.run=FALSE) })
We finish building the sampleMap
and construct the MultiAssayExperiment.
lrna = experiments(lihcrna)[[1]] sm = sampleMap(lihcrna) sm$assay = factor(rep("lihcrnaseq", nrow(sm))) sm2 = rbind(sm, lim_map) el = ExperimentList(lihcrnaseq=lrna, lihc450k=lim) lihcTM = MultiAssayExperiment(el, sampleMap=sm2, colData=colData(lihcrna)) lihcTM
The location of the 450k data can be seen as follows:
path(assay(lihcTM, "lihc450k"))
library(FDb.InfiniumMethylation.hg19) hm450.hg19 <- getPlatform(platform='HM450', genome='hg19') metn = rownames(lihcTM)[[2]] metloc = hm450.hg19[metn] metloc[1]
library(EnsDb.Hsapiens.v75) g37 = genes(EnsDb.Hsapiens.v75) g37[1]
We will use GRCh37 as the genome tag, and adopt the NCBI convention for chromosome enumeration.
genome(metloc) = "GRCh37" library(GenomeInfoDb) seqlevelsStyle(metloc) = "NCBI"
We'll use the default definition of promoters in conjunction with the gene addresses, and find CpGs in these regions.
pg37 = promoters(g37) cfilt = findOverlaps(metloc, pg37) memap = metloc[queryHits(cfilt)] memap$sym.prom = g37$symbol[subjectHits(cfilt)] memap[1:3]
As currently coded, MethylMix requires matrix inputs. We can derive the necessary inputs from the hybrid MultiAssayExperiment, using the annotation developed above. (With a small number of modifications that relax class checks in MethylMix, MethyMix can work directly from the out-of-memory representation, but we defer this for now.)
demoN = 500 # this is an overstatement of the number of CpG to be used # we will need to filter the CpG isolated into 'promoter' regions # above to match genes for which RNA-seq data are available nmet = experiments(lihcTM)[["lihc450k"]][seq_len(demoN),] # convenience subset rna = experiments(lihcTM)[["lihcrnaseq"]] # all RNA-seq ## delete memap = metloc[queryHits(cfilt)] memap$sym.prom = g37$symbol[subjectHits(cfilt)] ## memap = memap[intersect(names(memap), names(nmet))] # find the CpG-gene map nmet = nmet[names(memap)] # limit the CpG to those mapped rownames(nmet) = memap[rownames(nmet)]$sym.prom # rename the delayed rows oksym = intersect(rownames(nmet), rownames(rna)) # determine CpGs that correspond
Conversion of data to matrix form and separation of tumor and normal samples follows:
colnames(rna) = substr(colnames(rna),1,15) colnames(nmet) = substr(colnames(nmet),1,15) frna = assay(rna[oksym,]) etype = substr(colnames(frna),14,15) frna = frna[, which(etype=="01")] colnames(frna) = substr(colnames(frna),1,12) fmet = as(assay(nmet[oksym,]), "matrix") type = substr(colnames(nmet), 14, 15) tummet = fmet[, which(type=="01")] colnames(tummet) = substr(colnames(tummet),1,12) normmet = fmet[, which(type=="11")] colnames(normmet) = substr(colnames(normmet),1,12)
m1 = MethylMix(METcancer=tummet, GEcancer=frna, METnormal=normmet) pm = MethylMix_PlotModel("CCNJ", m1, tummet, frna, normmet) print(pm[[1]]) print(pm[[2]])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.