The R package yolo
is designed to subset large data based on column and row
attributes based on the familiar RangedSummarizedExperiment
(rse) and
SummarizedExperiment
(rse) structures without
holding the matrix of values in memory. To achieve this, an rseHandle
S4 object
is defined that inherits the RangedSummarizedExperiment
class with the addition
of two other slots that map the current object's row and column indicies to the
original indicies in the file(s). Analogously, the seHandle
inherits the
SummarizedExperiment
class when the rowspace is not a GRanges
object. Jointly,
we refer to the union of rseHandle
and seHandle
objects as yoloHandle
objects.
The getvalues
command can then evaluate an
yoloHandle
object and pull the data from the hard disk into memory. While adding and
subsetting an yoloHandle
object is endomorphic (i.e. returns the same
rseHandle
or seHandle
supplied by the user),
the output of the getvalues
is a RangedSummarizedExperiment
object or a
SummarizedExperiment
object depending on which is evaluated.
library(GenomicRanges) library(SummarizedExperiment) library(yolo) library(RSQLite) library(rhdf5)
In the current implementation of yolo
, we support storing data in HDF5
and sqlite file formats. Tables in these files may either be sparse
(three columns) or in a normal matrix representation. Though not
directly part of the this package, we show examples how to export
R data objects and files to HDF5 and SQLite file formats using the
rhdf5
and RSQLite
packages.
Notes:
1) the combination of "sparse" and "hdf5" are not supported. 2) all parameters throughout these functions should have no capital letters by convention.
Below is one simple implementation of converting
a .csv file that is in a sparse matrix format into a .sqlite
object.
f1name <- "d1.sqlite" db <- dbConnect(SQLite(), dbname=f1name) ft <- list(row="INTEGER", column="INTEGER", value="INTEGER") df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE) head(df1) dbWriteTable(conn=db, name="data", value=df1, field.types=ft) dbDisconnect(db)
The commands above create the "d1.sqlite" file, which can be linked to appropriate
column and row data to create an rseHandle
object. First, we import these data--
readt <- read.table(system.file("extdata", "dat1_row.bed", package = "yolo")) rowData1 <- GRanges(setNames(readt, c("chr", "start", "stop"))) colData1 <- read.table(system.file("extdata", "dat1_col.txt", package = "yolo"))
Next, we can build our rseHandle
object using the following function below.
d1 <- yoloHandleMake(rowData1, colData1, lookupFileName = f1name) d1
The yoloHandleMake
function necessarily takes a GRanges
of the rowData,
when wanting to create an rseHandle
and a DataFrame
object when constructing
an seHandle
object. In both cases, the constructor also takes
an object that can be coerced into a DataFrame for the colData and a valid
file name that contains the values of the matrices on the backend. When the
constructor function is called, other checks will determine the validity of
the construction to ensure that the specified objects will play nicely
together. In other words, the constructor checks to make sure the dimensions
of the rowData and colData represent the dimensions in the backend file.
Three other parameters can be specified in the yoloHandleMake
function. Namely,
the lookupTableName
specifies the index/table of the backend values. By default,
the constructor assumes "data" as we specified in the dbWriteTable
command earlier
in the vignette. Another import parameter is the lookupFileType
, which can be
specified as either "sparse" (by default) or "normal". For a sparse matrix,
we assume two columns labled "row" and "column" in addition to a third that has
the specific values. (See the ft
variable in the constructor above). For a "normal"
matrix, the lookup simply indexes off of row and column positions, so that the
names are not relevant for that operation. Finally, the lookupFileFormat
can
be either "HDF5" or "sqlite". The call to the yoloHandleMake
function above
utilized all default values--
(lookupTableName
= "data", lookupFileType
= "sparse", lookupFileFormat
= "sqlite")
Another implementation uses HDF5. Currently, yolo
only supports the "normal"
matrix implementation (sparse matricies are not supported). This is because
the author couldn't find a way to filter to rows based on values. This package
supports putting multiple tables in either an HDF5 or sqlite file, and the
implementation would look similar to the following.
f2name <- "dat.hdf5" h5createFile(f2name) # Read and Reshape 3 data objects to a normal matrix df1 <- read.table(system.file("extdata", "dat1.csv", package = "yolo"), sep = "," , header = TRUE) dat1m <- reshape2::acast(df1, row ~ column, fill = 0) df2 <- read.table(system.file("extdata", "dat2.csv", package = "yolo"), sep = "," , header = TRUE) dat2m <- reshape2::acast(df2, row ~ column, fill = 0) df3 <- read.table(system.file("extdata", "dat3.csv", package = "yolo"), sep = "," , header = TRUE) dat3m <- reshape2::acast(df3, row ~ column, fill = 0) # Write to file h5write(dat1m, "dat.hdf5","dat1") h5write(dat2m, "dat.hdf5","dat2") h5write(dat3m, "dat.hdf5","dat3") h5ls("dat.hdf5")
To create an rseHandle
for the first dataset--
d1h <- yoloHandleMake(rowData1, colData1, lookupFileName = f2name, lookupTableName = "dat1", lookupFileFormat = "HDF5", lookupFileType = "normal") d1h
We'll also create an rseHandle
object for the third data object referencing
the same HDF5 file but different colData. (dat1
and dat3
were designed
to have the same rowData
).
colData3 <- read.table(system.file("extdata", "dat3_col.txt", package = "yolo")) d3h <- yoloHandleMake(rowData1, colData3, lookupFileName = f2name, lookupTableName = "dat3", lookupFileFormat = "HDF5", lookupFileType = "normal")
For normal matrices, we recommend the HDF5 construct. However, if a user
prefers SQLite, this is supported. Our package assumes 1) the existance of
a "row_names" attribute in the table (automatically generated when
row.names = TRUE
as shown below) and 2) that each column name
corresponds to the sample names (or names of the colData) in the
collated object. Below is an example of this construction.
colnames(dat3m) <- rownames(colData3) dat3m <- data.frame(dat3m) db <- dbConnect(SQLite(), dbname=f1name) dbWriteTable(conn=db, name="data3", value=dat3m, row.names=TRUE) dbListFields(db, "data3") dbDisconnect(db) d3s <- yoloHandleMake(rowData1, colData3, lookupFileName = f1name, lookupTableName = "data3", lookupFileFormat = "sqlite", lookupFileType = "normal")
Again, we recommend working with HDF5 files for normal matrices. For sparse matrices, SQLite is currently the only supported format.
Users can add multiple rseHandle
objects together as long as
two condtions are valid--
1) The rowRanges/rowData
are the same
2) The names in colData
are the same
Even though d1
pulls from a sparse sqlite file and
d3h
pulls from a normal HDF5 file, these two handles
can be joined together because these two criteria are met.
Notice that the resulting object has 35 samples.
d13 <- d1 + d3h d13
This feature allows for samples from different experiments to be joined together at a high level again without reading any value data into memory aside from the column and row meta data.
Users can subset using the [
and subsetByOverlaps
calls that they
are accustomed to in a standard RangedSummarizedExperiment
.
dss1 <- d13[,d13@colData$group == "group4"] chr1reg <- GRanges(seqnames=c("chr1"),ranges=IRanges(start=c(3338300),end=c(3422000))) dss2 <- subsetByOverlaps(dss1, chr1reg) d_small <- dss2[c(2,3,6,7,10), c(2,6,7,3)] d_small
Through this process of adding and subsetting, we've jumbled up our samples.
Not to worry! Using the getvalues
function, the representation of our
matrix is data will be preserved through keeping track of the indices
of our files, rows, and columns.
rse_small <- getvalues(d_small) class(rse_small) assay(rse_small, 1)
Up until this getvalues
command, none of the values of the matrix
were being stored into disk. Thus, we could add and remove samples
as well as filter row regions based on GRanges/DataFrame
or index logic and
maintain the correct annotations corresponding to our data.
Without any use for our files on disk, we can tidy up and remove them.
file.remove(f1name) file.remove(f2name)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.