knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
fstArray provides an on-disk backend for a
DelayedArray object using
the fst file format. fstArray
provides a convenient way to store matrix-like data in an .fst
file and
operate on it using delayed operations and block processing.
The fst
file format is designed for serializing data frames. Therefore,
fstArray can only handle 2-dimensional arrays (matrices), which it does by
(ab)using the fst
file format to store columns of the matrix as columns of
the serialized data frame. Consequently, fstArray is best used for tasks
where data are processed column-wise (or by all columns at once) and, ideally,
by processing contiguous rows.
HDF5Array is the obvious
alternative to fstArray. The HDF5
file format provides arrays of
arbitrary dimension and allows alternative chunking and serialization
strategies to better support arbitrary data access patterns.
In initial testing, I find that fstArray is generally faster at reading data from disk, surprisingly often even when reading data using sub-optimal access patterns.
Limitations: Currently, there is no equivalent to
HDF5Array::writeHDF5Array()
, i.e. writefstArray()
. This will be added in a
future version of fstArray once the core fst library supports
appending data directly on/to disk.
For now, an fstArray can be created by wrapping a fst
file created by
fst::write_fst()
.
You can install fstArray from github with:
# install.packages("devtools") devtools::install_github("PeteHaitch/fstarray/")
devtools::load_all() library(fst) library(HDF5Array) library(microbenchmark)
library(fstArray) library(fst) library(HDF5Array) library(microbenchmark)
Here's a simple example comparing the time to read subsets of the data back into memory using fstArray and HDF5Array. fstArray and HDF5Array are compared without and with data compression[^compression]. To make the results more comparable, we chunk the HDF5Array by column when using compression, mimicing the storage scheme of the fstArray.
[^compression]: The level of compression is the default used by HDF5Array, scaled for use with fstArray.
The data are a 'long' matrix with 1 million rows and 10 columns filled with pseudorandom numbers between 0 and 1.
# Simulate a 'long' matrix with numeric data nrow <- 1e6 ncol <- 10 x <- matrix(runif(nrow * ncol), ncol = ncol, dimnames = list(NULL, letters[seq_len(ncol)]))
fstArray is slightly-to-noticeably faster than HDF5Array when reading contiguous rows of data into memory as an ordinary matrix, with the performance gap increasing when data compression is used. This is the best-case data access pattern for a fstArray.
As expected, fstArray is slower than HDF5Array when reading random rows of uncompressed data into memory as an ordinary matrix. Surprisingly, however, fstArray is faster than HDF5Array on this test when data compression is used.
Finally, the worst-case data access pattern for fstArray is reading random elements of data into memory as an ordinary vector. Surprisingly, fstArray is 2x faster than HDF5Array on this benchmark.
TODO: Revisit text after re-running benchmarking
We now compare the performance of loading the data into memory as an ordinary matrix. We first create fstArray and HDF5Array representations of the data to be used in this benchmarking:
# Construct fstArray and HDF5Array instances # Compression levels # NOTE: Amount of HDF5 compression (`level`) is on a scale of {0, 1, ..., 9} hdf5_compression_level <- getHDF5DumpCompressionLevel() hdf5_compression_level # NOTE: Amount of fst compression (`compress`) is on a scale of {0, 1, ..., 100} fst_compression_level <- as.integer(hdf5_compression_level / 9 * 101) # NOTE: Each fst file can only contain one dataset whereas a HDF5 file can # contain multiple datasets fst_no_compression <- tempfile(fileext = ".fst") fst::write_fst(x = as.data.frame(x), path = fst_no_compression, compress = 0) fstarray_no_compression <- fstArray(fst_no_compression) fst_with_compression <- tempfile(fileext = ".fst") fst::write_fst(x = as.data.frame(x), path = fst_with_compression, compress = fst_compression_level) fstarray_with_compression <- fstArray(fst_with_compression) hdf5array_no_compression <- writeHDF5Array(x = x, chunk_dim = c(nrow(x), 1L), level = 0) hdf5array_with_compression <- writeHDF5Array(x = x, chunk_dim = c(nrow(x), 1L), level = hdf5_compression_level)
All objects contain the same data[^dimnames]:
[^dimnames]: Currently, a HDF5Array cannot store the dimnames of the array whereas an fstArray can stored column names but not rownames.
all.equal(as.matrix(fstarray_no_compression), as.matrix(fstarray_with_compression)) all.equal(as.matrix(fstarray_no_compression), as.matrix(hdf5array_no_compression), check.attributes = FALSE) all.equal(as.matrix(fstarray_no_compression), as.matrix(hdf5array_with_compression), check.attributes = FALSE)
Similar file sizes are achieved for fstArray and HDF5Array instances:
file_sizes_df <- data.frame( name = c("fstarray_no_compression", "fstarray_with_compression", "hdf5array_no_compression", "hdf5array_with_compression"), class = c("_fstArray_", "_fstArray_", "_HDF5Array_", "_HDF5Array_"), compression = c(FALSE, TRUE, FALSE, TRUE), `size (bytes)` = c(file.size(seed(fstarray_no_compression)@file), file.size(seed(fstarray_with_compression)@file), file.size(seed(hdf5array_no_compression)@file), file.size(seed(hdf5array_with_compression)@file)), check.names = FALSE) knitr::kable(file_sizes_df)
It is faster to load an entire fstArray than an entire HDF5Array. This is despite fstArray currently having to do a coercion from a data frame to a matrix when loading data into memory[^coercion].
# Benchmark loading all data from disk as an ordinary matrix # TODO: Make a plot of result microbenchmark( fstarray_no_compression = as.matrix(fstarray_no_compression), fstarray_with_compression = as.matrix(fstarray_with_compression), hdf5array_no_compression = as.matrix(hdf5array_no_compression), hdf5array_with_compression = as.matrix(hdf5array_with_compression), times = 10)
Now, loading 10,000 contiguous rows from the middle of the matrix. Again, using an fstArray is a faster than an HDF5Array.
rows <- 500001:600000 # TODO: Make a plot of result (either ggplot2::autoplot() or what's done in fst # README) microbenchmark( fstarray_no_compression = as.matrix(fstarray_no_compression[rows, ]), fstarray_with_compression = as.matrix(fstarray_with_compression[rows, ]), hdf5array_no_compression = as.matrix(hdf5array_no_compression[rows, ]), hdf5array_with_compression = as.matrix(hdf5array_with_compression[rows, ]), times = 10)
We load 10,000 randomly selected rows, the worst data access pattern for an fstArray. Unsurprisingly, an fstArray is slower than HDF5Array, so algorithms should be designed accordingly when targetting fstArray.
rows <- sample(nrow(x), 10000) microbenchmark( fstarray_no_compression = as.matrix(fstarray_no_compression[rows, ]), fstarray_with_compression = as.matrix(fstarray_with_compression[rows, ]), hdf5array_no_compression = as.matrix(hdf5array_no_compression[rows, ]), hdf5array_with_compression = as.matrix(hdf5array_with_compression[rows, ]), times = 10)
Benchmark loading 100,000 random elements of data from disk as an ordinary vector
TODO: How does [,DelayedArray-method
work?
elements <- sample(length(x), 100000) microbenchmark( fstarray_no_compression = fstarray_no_compression[elements], fstarray_with_compression = fstarray_with_compression[elements], hdf5array_no_compression = hdf5array_no_compression[elements], hdf5array_with_compression = hdf5array_with_compression[elements], times = 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.