knitr::opts_chunk$set(echo = TRUE, comment = "#>", collapse = TRUE, message = FALSE)
r Biocpkg("DelayedMatrixStats")
ports the r CRANpkg("matrixStats")
API
to work with DelayedMatrix objects from the r Biocpkg("DelayedArray")
package. It provides high-performing functions operating on rows and columns of
DelayedMatrix objects, including all subclasses such as RleArray (from the
r Biocpkg("DelayedArray")
package) and HDF5Array (from the
r Biocpkg("HDF5Array")
) as well as supporting all types of seeds, such as
matrix (from the base package) and Matrix (from the r CRANpkg("Matrix")
package).
The r Biocpkg("DelayedArray")
package allows developers to store array-like
data using in-memory or on-disk representations (e.g., in HDF5 files) and
provides a common and familiar array-like interface for interacting with these
data.
The r Biocpkg("DelayedMatrixStats")
package is designed to make life easier
for Bioconductor developers wanting to use r Biocpkg("DelayedArray")
by
providing a rich set of column-wise and row-wise summary functions.
We briefly demonstrate and explain these two features using a simple example. We'll simulate some (unrealistic) RNA-seq read counts data from 10,000 genes and 20 samples and store it on disk as a HDF5Array:
library(DelayedArray) x <- do.call(cbind, lapply(1:20, function(j) { rpois(n = 10000, lambda = sample(20:40, 10000, replace = TRUE)) })) colnames(x) <- paste0("S", 1:20) x <- realize(x, "HDF5Array") x
Suppose you wish to compute the standard deviation of the read counts for each gene.
You might think to use apply()
like in the following:
system.time(row_sds <- apply(x, 1, sd)) head(row_sds)
This works, but takes quite a while.
Or perhaps you already know that the r CRANpkg("matrixStats")
package
provides a rowSds()
function:
matrixStats::rowSds(x)
Unfortunately (and perhaps unsurprisingly) this doesn't work.
r CRANpkg("matrixStats")
is designed for use on in-memory matrix objects.
Well, why don't we just first realize our data in-memory and then use
r CRANpkg("matrixStats")
system.time(row_sds <- matrixStats::rowSds(as.matrix(x))) head(row_sds)
This works and is many times faster than the apply()
-based approach! However,
it rather defeats the purpose of using a HDF5Array for storing the
data since we have to bring all the data into memory at once to compute the
result.
Instead, we can use DelayedMatrixStats::rowSds()
, which has the speed
benefits of matrixStats::rowSds()
[^speed] but without having to load the
entire data into memory at once[^block_size]:
[^speed]: In fact, it currently uses matrixStats::rowSds()
under the hood.
[^block_size]: In this case, it loads blocks of data row-by-row. The amount of
data loaded into memory at any one time is controlled by the
default block size global setting; see ?DelayedArray::getAutoBlockSize
for details. Notably, if the data are small enough (and the default block size
is large enough) then all the data is loaded as a single block, but this
approach generalizes and still works when the data are too large to be
loaded into memory in one block.
library(DelayedMatrixStats) system.time(row_sds <- rowSds(x)) head(row_sds)
Finally, by using r Biocpkg("DelayedMatrixStats")
we can use the same code,
(colMedians(x)
) regardless of whether the input is an ordinary matrix or a
DelayedMatrix. This is useful for packages wishing to support both types of
objects, e.g., packages wanting to retain backward compatibility or during a
transition period from matrix-based to DelayeMatrix-based objects.
The initial release of r Biocpkg("DelayedMatrixStats")
supports the complete
column-wise and row-wise API r CRANpkg("matrixStats")
API[^api]. Please
see the r CRANpkg("matrixStats")
vignette
(available online)
for a summary these methods. The following table documents the API coverage and
availability of 'seed-aware' methods in the current
version of r Biocpkg("DelayedMatrixStats")
, where:
r Biocpkg("DelayedMatrixStats")
[^api]: Some of the API is covered via inheritance to functionality in r Biocpkg("DelayedArray")
matrixStats <- sort( c("colsum", "rowsum", grep("^(col|row)", getNamespaceExports("matrixStats"), value = TRUE))) sparseMatrixStats <- getNamespaceExports("sparseMatrixStats") DelayedMatrixStats <- getNamespaceExports("DelayedMatrixStats") DelayedArray <- getNamespaceExports("DelayedArray") api_df <- data.frame( Method = paste0("`", matrixStats, "()`"), `Block processing` = ifelse( matrixStats %in% DelayedMatrixStats, "✔", ifelse(matrixStats %in% c(DelayedArray, sparseMatrixStats), "☑️", "❌")), `_base::matrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "matrix_OR_array_OR_table_OR_numeric"), "✔", "❌"), `_Matrix::dgCMatrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "dgCMatrix"), "✔", "❌"), `_Matrix::lgCMatrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "lgCMatrix"), "✔", "❌"), `_DelayedArray::RleArray_ (_SolidRleArraySeed_) optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "SolidRleArraySeed"), "✔", "❌"), `_DelayedArray::RleArray_ (_ChunkedRleArraySeed_) optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "ChunkedRleArraySeed"), "✔", "❌"), `_HDF5Array::HDF5Matrix_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "HDF5ArraySeed"), "✔", "❌"), `_base::data.frame_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "data.frame"), "✔", "❌"), `_S4Vectors::DataFrame_ optimized` = ifelse(sapply(matrixStats, existsMethod, signature = "DataFrame"), "✔", "❌"), check.names = FALSE) knitr::kable(api_df, row.names = FALSE)
As well as offering a familiar API, r Biocpkg("DelayedMatrixStats")
provides
'seed-aware' methods that are optimized for specific types of DelayedMatrix
objects.
To illustrate this idea, we will compare two ways of computing the column sums of a DelayedMatrix object:
r Biocpkg("DelayedArray")
package and is available for all methods in the r Biocpkg("DelayedMatrixStats")
through the force_block_processing
argumentr Biocpkg("DelayedMatrixStats")
and is optimized for both speed and memory but only for DelayedMatrix objects with certain types of seed.We will demonstrate this by computing the column sums matrices with 20,000 rows and 600 columns where the data have different structure and are stored in DelayedMatrix objects with different types of seed:
r CRANpkg("Matrix")
package, as the seedr Biocpkg("DelayedArray")
package as the seedWe use the r CRANpkg("microbenchmark")
package to measure running time and
the r CRANpkg("profmem")
package to measure the total memory allocations of
each method.
In each case, the 'seed-aware' method is many times faster and allocates substantially lower total memory.
library(DelayedMatrixStats) library(sparseMatrixStats) library(microbenchmark) library(profmem) set.seed(666) # ----------------------------------------------------------------------------- # Dense with values in (0, 1) # Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed # # Generate some data dense_matrix <- matrix(runif(20000 * 600), nrow = 20000, ncol = 600) # Benchmark dm_matrix <- DelayedArray(dense_matrix) class(seed(dm_matrix)) dm_matrix microbenchmark( block_processing = colSums2(dm_matrix, force_block_processing = TRUE), seed_aware = colSums2(dm_matrix), times = 10) total(profmem(colSums2(dm_matrix, force_block_processing = TRUE))) total(profmem(colSums2(dm_matrix))) # ----------------------------------------------------------------------------- # Sparse (60% zero) with values in (0, 1) # Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed # # Generate some data sparse_matrix <- dense_matrix zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix)) sparse_matrix[zero_idx] <- 0 # Benchmark dm_dgCMatrix <- DelayedArray(Matrix(sparse_matrix, sparse = TRUE)) class(seed(dm_dgCMatrix)) dm_dgCMatrix microbenchmark( block_processing = colSums2(dm_dgCMatrix, force_block_processing = TRUE), seed_aware = colSums2(dm_dgCMatrix), times = 10) total(profmem(colSums2(dm_dgCMatrix, force_block_processing = TRUE))) total(profmem(colSums2(dm_dgCMatrix))) # ----------------------------------------------------------------------------- # Dense with values in {0, 100} featuring runs of identical values # Fast, memory-efficient column sums of DelayedMatrix with Rle-based seed # # Generate some data runs <- rep(sample(100, 500000, replace = TRUE), rpois(500000, 100)) runs <- runs[seq_len(20000 * 600)] runs_matrix <- matrix(runs, nrow = 20000, ncol = 600) # Benchmark dm_rle <- RleArray(Rle(runs), dim = c(20000, 600)) class(seed(dm_rle)) dm_rle microbenchmark( block_processing = colSums2(dm_rle, force_block_processing = TRUE), seed_aware = colSums2(dm_rle), times = 10) total(profmem(colSums2(dm_rle, force_block_processing = TRUE))) total(profmem(colSums2(dm_rle)))
The development of 'seed-aware' methods is ongoing work (see the Roadmap), and for now only a few methods and seed-types have a 'seed-aware' method.
An extensive set of benchmarks is under development at http://peterhickey.org/BenchmarkingDelayedMatrixStats/.
A key feature of a DelayedArray is the ability to register 'delayed
operations'. For example, let's compute sin(dm_matrix)
:
system.time(sin_dm_matrix <- sin(dm_matrix))
This instantaneous because the operation is not actually performed, rather
it is registered and only performed when the object is realized. All methods
in r Biocpkg("DelayedMatrixStats")
will correctly realise these delayed
operations before computing the final result. For example, let's compute
colSums2(sin_dm_matrix)
and compare check we get the correct answer:
all.equal(colSums2(sin_dm_matrix), colSums(sin(as.matrix(dm_matrix))))
The initial version of r Biocpkg("DelayedMatrixStats")
provides complete
coverage of the r CRANpkg("matrixStats")
column-wise and row-wise API[^api],
allowing package developers to use these functions with DelayedMatrix objects
as well as with ordinary matrix objects. This should simplify package
development and assist authors to support to their software for large datasets
stored in disk-backed data structures such as HDF5Array. Such large datasets
are increasingly common with the rise of single-cell genomics.
Future releases of r Biocpkg("DelayedMatrixStats")
will improve the
performance of these methods, specifically by developing additional 'seed-aware'
methods. The plan is to prioritise commonly used methods (e.g.,
colMeans2()
/rowMeans2()
, colSums2()
/rowSums2()
, etc.) and the
development of 'seed-aware' methods for the HDF5Matrix class. To do so, we
will leverage the r Biocpkg("beachmat")
package. Proof-of-concept code
has shown that this can greatly increase the performance when analysing such
disk-backed data.
Importantly, all package developers using methods from
r Biocpkg("DelayedMatrixStats")
will immediately gain from performance
improvements to these low-level routines. By using
r Biocpkg("DelayedMatrixStats")
, package developers will be able to focus on
higher level programming tasks and address important scientific questions and
technological challenges in high-throughput biology.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.