title: "restfulSE -- experiments with HDF5 server content wrapped in SummarizedExperiment" author: "Vincent J. Carey, stvjc at channing.harvard.edu, Shweta Gopaulakrishnan, reshg at channing.harvard.edu, includes contributions from S. Pollack" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{restfulSE -- experiments with SE interface to remote HDF5} %\VignetteEncoding{UTF-8} output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes
```{r setup,echo=FALSE,results="hide"} suppressPackageStartupMessages({ library(restfulSE) library(GO.db) library(org.Hs.eg.db) library(SummarizedExperiment) library(ExperimentHub) library(AnnotationHub) })
# restfulSE
This R package includes proof-of-concept code illustrating several approaches to SummarizedExperiment design with
assays stored out-of-memory.
## HSDS-backed SummarizedExperiment
Several data structures are introduced
- to model the HDF Scalable Data Service (HSDS) server data architecture and
- to perform targeted extraction of numerical data from HDF5 arrays stored on the server.
We work with HDF Object store (https://www.hdfgroup.org/solutions/hdf-cloud/).
### Illustration with 10x genomics 1.3 million neurons
We used Martin Morgan's [TENxGenomics](https://github.com/mtmorgan/TENxGenomics) package to transform the sparse-formatted HDF5 supplied by 10x into a
dense HDF5 matrix to support natural slicing. Thanks to native compression
in HDF5, the data volume expansion is modest.
A helper function in the restfulSE package creates a `RESTfulSummarizedExperiment` instance that points to the full numerical dataset.
```{r do10x,eval=TRUE}
library(restfulSE)
my10x = se1.3M()
my10x
As an exercise, we acquire the ENSEMBL identifiers for mouse genes annotated to hippocampus development, which has GO ID GO:0021766, and check counts for 10 genes on 6 samples: ```{r doanno, eval=TRUE} library(org.Mm.eg.db) hippdev = select(org.Mm.eg.db, keys="GO:0021766", keytype="GO", column="ENSEMBL")$ENSEMBL hippdev = intersect(hippdev, rownames(my10x))
The result:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 [3,] 0 0 0 1 0 0 [4,] 0 1 2 6 5 0 [5,] 0 0 0 0 0 0 [6,] 1 2 4 8 7 3 [7,] 0 0 0 0 0 0 [8,] 0 0 0 0 0 2 [9,] 0 0 0 0 0 0 [10,] 3 0 3 0 1 9
### Illustration with GTEx tissue expression
We exported the content of the [recount2 GTEx
gene-level quantifications](http://duffel.rail.bio/recount/SRP012682/rse_gene.Rdata) to our HDF5 server. A convenience function is
available:
```{r lktiss, eval=TRUE}
tiss = gtexTiss()
tiss
We'll use this remote data as a tool for investigating transcriptional patterns in brain anatomy. We can identify the samples from brain using the 'smtsd' colData element:
```{r findbr} binds = grep("Brain", tiss$smtsd) table(tiss$smtsd[binds][1:100]) # check diversity in 100 samples
We'll identify genes annotated to neurotrophic functions
using another convenience function in this package:
```{r findn}
ntgenes = goPatt(termPattern="neurotroph")
head(ntgenes)
Extensive human and computational effort is expended on downloading and managing large genomic data at site of analysis. Interoperable formats that are accessible via generic operations like those in RESTful APIs may help to improve cost-effectiveness of genome-scale analyses.
In this report we examine the use of HDF5 server as a back end for assay data, mediated through the RangedSummarizedExperiment API for interactive use.
A modest server configured to deliver HDF5 content via a RESTful API has been prepared and is used in this vignette.
We want to provide rapid access to array-like data. We'll work with the Banovich 450k data as there is a simple check against an in-memory representation.
```{r setup2,echo=FALSE} suppressPackageStartupMessages({ library(restfulSE) library(SummarizedExperiment) library(Rtsne) library(rhdf5client) })
We build a SummarizedExperiment by combining an assay-free
RangedSummarizedExperiment with this reference.
```{r doba3}
ehub = ExperimentHub::ExperimentHub()
tag = names(AnnotationHub::query(ehub, "banoSEMeta"))
banoSE = ehub[[tag[1]]]
ds = HSDSArray(endpoint=URL_hsds(),svrtype="hsds",
domain="/shared/bioconductor/bano_meQTLex.h5",dsetname="/assay001")
assays(banoSE, withDimnames=FALSE) = SimpleList(betas=ds)
banoSE
We can update the SummarizedExperiment metadata
through subsetting operations, and then extract the relevant
assay data. The data are retrieved from the remote server
with the assay
method.
```{r doba4}
rbanoSub = banoSE[5:8, c(3:9, 40:50)]
assay(rbanoSub)
## 10xGenomics examples
### t-SNE for a set of genes annotated to hippocampus
We have used Martin Morgan's TENxGenomics package
to create a dense HDF5 representation of the
assay data, and placed it in HSDS.
We will subset genes annotated to hippocampus development.
Here are some related categories:
12092 GO:0021766 hippocampus development 12096 GO:0021770 parahippocampal gyrus development 34609 GO:0097410 hippocampal interneuron differentiation 34631 GO:0097432 hippocampal pyramidal neuron differentiation 34656 GO:0097457 hippocampal mossy fiber 35169 GO:0098686 hippocampal mossy fiber to CA3 synapse 42398 GO:1990026 hippocampal mossy fiber expansion
Oct 2022: Following code needs revision.
```{r anno,eval=FALSE}
library(org.Mm.eg.db)
atab = select(org.Mm.eg.db, keys="GO:0021766", keytype="GO", columns="ENSEMBL")
hg = atab[,"ENSEMBL"]
ehub = ExperimentHub::ExperimentHub()
lkst100k = AnnotationHub::query(ehub, "st100k")
tenx100k = ehub[[names(lkst100k)]]
length(hgok <- intersect(hg, rownames(tenx100k)))
This is a very scattered collection of rows in the matrix. We acquire expression measures for genes annotated to hippocampus on 4000 samples. t-SNE is then used to project the log-transformed measures to the plane.
```{r getdat, cache=FALSE,eval=FALSE} hipn = assay(tenx100k[hgok,1:4000]) # slow d = dist(t(log(1+hipn)), method="manhattan") proj = Rtsne(d)
```{r plt,fig=FALSE,eval=FALSE}
plot(proj$Y)
Tasic et al. (Nature neuro 2016, DOI 10.1038/nn.4216) describe single cell analysis of the adult murine brain, identify clusters of cells with distinct transcriptional profiles and anatomic location, and enumerate lists of genes that discriminate these clusters. The tasicST6 DataFrame provides details.
```{r lktas}
ehub = ExperimentHub::ExperimentHub() tag = names(AnnotationHub::query(ehub, "tasicST6")) tasicST6 = ehub[[tag[1]]] tasicST6
Key high-level discrimination concerns cells regarded as
GABAergic vs. glutamatergic (inhibitory vs excitatory
neurotransmission).
## Background
Banovich et al. published a subset of DNA methylation measures
assembled on 64 samples of immortalized B-cells from the YRI HapMap cohort.
```{r lkd}
library(restfulSE)
ehub = ExperimentHub::ExperimentHub()
tag = names(AnnotationHub::query(ehub, "banoSEMeta"))
banoSEMeta = ehub[[tag[1]]]
banoSEMeta
The numerical data have been exported using H. Pages' saveHDF5SummarizedExperiment applied to the banovichSE SummarizedExperiment in the yriMulti package. The HDF5 component is imported to HSDS with the hsload utility disributed with the python package h5pyd.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.