suppressPackageStartupMessages({ suppressMessages({ library(hsdsRclient) }) })
We are learning how to design R code to work with HSDS, an object store interface to HDF5. The hsdsRclient package will likely merge with rhdf5client soon (Q4 2017).
Please note the version of rhdf5client used here. There have been issues with the handling of binary transfer from HSDS that are not fully resolved.
sessionInfo()
There is a long-running server on XSEDE jetstream that can be contacted from outside. We demonstrate sample-level sums for the 10x 1.3 million neuron dataset.
hostPath = "/home/john/tenx_full.h5" serverPort = "http://149.165.156.174:5101/" txsrc = HSDS_source(serverPort, hostPath) # top level object tx2 = HSDS_dataset(txsrc) # get a sliceable reference apply(tx2[1:6, 1:27998],1,sum) # compute sample-level sums for first six
Let's benchmark this activity and then drill a little deeper to the communication details.
library(microbenchmark) microbenchmark(apply(tx2[1:6, 1:27998],1,sum), times=5)
There is a slot called transfermode
in tx2
:
tx2@transfermode
Change to 'binary':
tx2@transfermode = "binary" microbenchmark(txs6 <- apply(tx2[1:6, 1:27998],1,sum), times=5) txs6
A modest improvement.
Let's do 500 samples with the simple call to binary interface.
microbenchmark(s500 <- apply(tx2[1:500, 1:27998],1,sum), times=2)
Some answers:
s500[1:6] summary(s500)
It seems difficult to use multicore computing with the communications required here. So we use socket-based job distribution on a multicore machine.
library(BiocParallel) spar = SnowParam(2, type="SOCK") register(spar)
With SnowParam, a complete environment
needs to be prepared on each slave.
We encapsulate the retrieval task in function
encaps
.
encaps = function (specl = list(inds = 1:2, port = "5101")) { stopifnot(all.equal(names(specl), c("inds", "port"))) library(hsdsRclient) hostPath = "/home/john/tenx_full.h5" serverPort = sprintf("http://149.165.156.174:%s/", specl$port) txsrc = HSDS_source(serverPort, hostPath) tx2 = HSDS_dataset(txsrc) apply(tx2[specl$inds, 1:27998], 1, sum) }
We will now query port 5101 with two parallel requests.
system.time(sans <- bplapply(list( list(inds=1:250, port="5101"), list(inds=251:500, port="5101")), encaps)) summary(unlist(sans))
Now use two different ports.
system.time(sans <- bplapply(list( list(inds=1:250, port="5101"), list(inds=251:500, port="5102")), encaps)) summary(unlist(sans))
Now use four different ports.
system.time(sans <- bplapply(list( list(inds=1:125, port="5101"), list(inds=126:250, port="5102"), list(inds=251:375, port="5103"), list(inds=376:500, port="5104")), encaps)) summary(unlist(sans))
Now use four different ports, full initial yield per port.
system.time(sans <- bplapply(list( list(inds=1:500, port="5101"), list(inds=501:1000, port="5102"), list(inds=1001:1500, port="5103"), list(inds=1501:2000, port="5104")), encaps)) summary(unlist(sans))
Do it sequentially.
system.time(sans <- lapply(list( list(inds=1:500, port="5101"), list(inds=501:1000, port="5102"), list(inds=1001:1500, port="5103"), list(inds=1501:2000, port="5104")), encaps)) summary(unlist(sans))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.