stream-stats: Streaming Summary Statistics
In matter: A framework for rapid prototyping with file-based data structures

Description Usage Arguments Details Value Author(s) References See Also Examples

These functions allow calculation of streaming statistics. They are useful, for example, for calculating summary statistics on small chunks of a larger dataset, and then combining them to calculate the summary statistics for the whole dataset.

This is not particularly interesting for simpler, commutative statistics like sum(), but it is useful for calculating non-commutative statistics like running sd() or var() on pieces of a larger dataset.

# calculate streaming univariate statistics
s_range(x, ..., na.rm = FALSE)

s_min(x, ..., na.rm = FALSE)

s_max(x, ..., na.rm = FALSE)

s_prod(x, ..., na.rm = FALSE)

s_sum(x, ..., na.rm = FALSE)

s_mean(x, ..., na.rm = FALSE)

s_var(x, ..., na.rm = FALSE)

s_sd(x, ..., na.rm = FALSE)

s_any(x, ..., na.rm = FALSE)

s_all(x, ..., na.rm = FALSE)

s_nnzero(x, ..., na.rm = FALSE)

# calculate streaming matrix statistics
colstreamStats(x, stat, na.rm = FALSE, ...)

rowstreamStats(x, stat, na.rm = FALSE, ...)

# calculate combined summary statistics
stat_c(x, y, ...)

`x, y, ...`	Object(s) on which to calculate a summary statistic, or a summary statistic to combine.
`stat`	The name of a summary statistic to compute over the rows or columns of a matrix. Allowable values include: "range", "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero".
`na.rm`	If `TRUE`, remove `NA` values before summarizing.

These summary statistics methods are intended to be applied to chunks of a larger dataset. They can then be combined either with the individual summary statistic functions, or with stat_c(), to produce the combined summary statistic for the full dataset. This is most useful for calculating running variances and standard deviations iteratively, which would be difficult or impossible to calculate on the full dataset.

The variances and standard deviations are calculated using running sum of squares formulas which can be calculated iteratively and are accurate for large floating-point datasets (see reference).

For all univariate functions except s_range(), a single number giving the summary statistic. For s_range(), two numbers giving the minimum and the maximum values.

For colstreamStats() and rowstreamStats(), a vector of summary statistics.

Kylie A. Bemis

B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.

B. O'Neill, “Some Useful Moment Results in Sampling Problems,” The American Statistician, vol. 68, no. 4, pp. 282-296, Sep. 2014.

Summary

set.seed(1)
x <- sample(1:100, size=10)
y <- sample(1:100, size=10)

sx <- s_var(x)
sy <- s_var(y)

var(c(x, y))
stat_c(sx, sy) # should be the same

sxy <- stat_c(sx, sy)

# calculate with 1 new observation
var(c(x, y, 99))
stat_c(sxy, 99)

# calculate over rows of a matrix
set.seed(2)
A <- matrix(rnorm(100), nrow=10)
B <- matrix(rnorm(100), nrow=10)

sx <- rowstreamStats(A, "var")
sy <- rowstreamStats(B, "var")

apply(cbind(A, B), 1, var)
stat_c(sx, sy) # should be the same