chunkApply | R Documentation |
Perform equivalents of apply
, lapply
, and mapply
, but over parallelized chunks of data. This is most useful if accessing the data is potentially time-consuming, such as for file-based matter
objects. Operating on chunks reduces the number of I/O operations.
## Operate on elements/rows/columns
chunkApply(X, MARGIN, FUN, ...,
simplify = FALSE, outpath = NULL,
verbose = NA, BPPARAM = bpparam())
chunkLapply(X, FUN, ...,
simplify = FALSE, outpath = NULL,
verbose = NA, BPPARAM = bpparam())
chunkMapply(FUN, ...,
simplify = FALSE, outpath = NULL,
verbose = NA, BPPARAM = bpparam())
## Operate on complete chunks
chunk_rowapply(X, FUN, ...,
simplify = "c", depends = NULL, permute = FALSE,
RNG = FALSE, verbose = NA, chunkopts = list(),
BPPARAM = bpparam())
chunk_colapply(X, FUN, ...,
simplify = "c", depends = NULL, permute = FALSE,
RNG = FALSE, verbose = NA, chunkopts = list(),
BPPARAM = bpparam())
chunk_lapply(X, FUN, ...,
simplify = "c", depends = NULL, permute = FALSE,
RNG = FALSE, verbose = NA, chunkopts = list(),
BPPARAM = bpparam())
chunk_mapply(FUN, ..., MoreArgs = NULL,
simplify = "c", depends = NULL, permute = FALSE,
RNG = FALSE, verbose = NA, chunkopts = list(),
BPPARAM = bpparam())
X |
A matrix for |
MARGIN |
If the object is matrix-like, which dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns. The dimension names can also be used if |
FUN |
The function to be applied. |
MoreArgs |
A list of other arguments to |
... |
Additional arguments to be passed to |
simplify |
Should the result be simplified into a vector, matrix, or higher dimensional array? |
outpath |
If non-NULL, a file path where the results should be written as they are processed. If specified, |
verbose |
Should user messages be printed with the current chunk being processed? If |
chunkopts |
An (optional) list of chunk options including |
depends |
A list with length equal to the extent of |
permute |
Should the order of items be randomized? This may be useful for iterating over random subsets. No attempt is made to re-order the results. |
RNG |
Should the local random seed (as set by |
BPPARAM |
An optional instance of |
For chunkApply()
, chunkLapply()
, and chunkMapply()
:
For vectors and lists, the vector is broken into some number of chunks according to chunks
. The individual elements of the chunk are then passed to FUN
.
For matrices, the matrix is chunked along rows or columns, based on the number of chunks
. The individual rows or columns of the chunk are then passed to FUN
.
In this way, the first argument of FUN
is analogous to using the base apply
, lapply
, and mapply
functions.
For chunk_rowapply()
, chunk_colapply()
, chunk_lapply()
, and chunk_mapply()
:
In this situation, the entire chunk is passed to FUN
, and FUN
is responsible for knowing how to handle a sub-vector or sub-matrix of the original object. This may be useful if FUN
is already a function that could be applied to the whole object such as rowSums
or colSums
.
When this is the case, it may be useful to provide a custom simplify
function.
For convenience to the programmer, several attributes are made available when operating on a chunk.
"chunkid": The index of the chunk currently being processed by FUN
.
"chunklen": The number of elements in the chunk that should be processed.
"index": The indices of the elements of the chunk, as elements/rows/columns in the original matrix/vector.
"depends" (optional): If depends
is given, then this is a list of indices within the chunk. The length of the list is equal to the number of elements/rows/columns in the chunk. Each list element is either NULL
or a vector of indices giving the elements/rows/columns of the chunk that should be processed for that index. The indices that should be processed will be non-NULL
, and indices that should be ignored will be NULL
.
The depends
argument can be used to iterate over dependent elements of a vector, or dependent rows/columns of a matrix. This can be useful if the calculation for a particular row/column/element depends on the values of others.
When depends
is provided, multiple rows/columns/elements will be passed to FUN
. Each element of the depends
list should be a vector giving the indices that should be passed to FUN
.
For example, this can be used to implement a rolling apply function.
Several options are supported by chunkopts
to override the global options:
nchunks: The number of chunks to use. If omitted, this is taken from getOption("matter.default.nchunks")
. For IO-bound operations, using fewer chunks will often be faster, but use more memory.
chunksize: The approximate chunk size in bytes. If omitted, this is taken from getOption("matter.default.chunksize")
. For IO-bound operations, using larger chunks will often be faster, but use more memory. If set to NA_real_
, then the chunk size is determined by the number of chunks.
serialize: Whether data in virtual memory should be realized on the manager and serialized to the workers (TRUE
), passed to the workers in virtual memory as-is (FALSE
), or if matter
should decide the behavior based on the cluster configuration (NA
). If omitted, this is taken from getOption("matter.default.serialize")
. If all workers have access to the same virtual memory resources (whether file storage or shared memory), then it can be significantly faster to avoid serializing the data.
Typically, a list if simplify=FALSE
. Otherwise, the results may be coerced to a vector or array.
Kylie A. Bemis
apply
,
lapply
,
mapply
,
RNGkind
,
RNGStreams
,
SnowfastParam
register(SerialParam())
set.seed(1)
x <- matrix(rnorm(1000^2), nrow=1000, ncol=1000)
out <- chunkApply(x, 1L, mean, chunkopts=list(nchunks=10))
head(out)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.