knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
You can install the release version of r BiocStyle::Biocpkg("sparseMatrixStats")
from BioConductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("sparseMatrixStats")
The sparseMatrixStats package implements a number of summary functions for sparse matrices from the r BiocStyle::CRANpkg("Matrix")
package.
Let us load the package and create a synthetic sparse matrix.
library(sparseMatrixStats) # Matrix defines the sparse Matrix class # dgCMatrix that we will use library(Matrix) # For reproducibility set.seed(1)
Create a synthetic table with customers, items, and how often they bought that item.
customer_ids <- seq_len(100) item_ids <- seq_len(30) n_transactions <- 1000 customer <- sample(customer_ids, size = n_transactions, replace = TRUE, prob = runif(100)) item <- sample(item_ids, size = n_transactions, replace = TRUE, prob = runif(30)) tmp <- table(paste0(customer, "-", item)) tmp2 <- strsplit(names(tmp), "-") purchase_table <- data.frame( customer = as.numeric(sapply(tmp2, function(x) x[1])), item = as.numeric(sapply(tmp2, function(x) x[2])), n = as.numeric(tmp) ) head(purchase_table, n = 10)
Let us turn the table into a matrix to simplify the analysis:
purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item, x = purchase_table$n, dims = c(100, 30), dimnames = list(customer = paste0("Customer_", customer_ids), item = paste0("Item_", item_ids))) purchase_matrix[1:10, 1:15]
We can see that some customers did not buy anything, where as some bought a lot.
sparseMatrixStats
can help us to identify interesting patterns in this data:
# How often was each item bough in total? colSums2(purchase_matrix) # What is the range of number of items each # customer bought? head(rowRanges(purchase_matrix)) # What is the variance in the number of items # each customer bought? head(rowVars(purchase_matrix)) # How many items did a customer not buy at all, one time, 2 times, # or exactly 4 times? head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))
In the previous section, I demonstrated how to create a sparse matrix from scratch using the sparseMatrix()
function.
However, often you already have an existing matrix and want to convert it to a sparse representation.
mat <- matrix(0, nrow=10, ncol=6) mat[sample(seq_len(60), 4)] <- 1:4 # Convert dense matrix to sparse matrix sparse_mat <- as(mat, "dgCMatrix") sparse_mat
The sparseMatrixStats package is a derivative of the r BiocStyle::CRANpkg("matrixStats")
package and implements it's API for
sparse matrices. For example, to calculate the variance for each column of mat
you can do
apply(mat, 2, var)
However, this is quite inefficient and matrixStats provides the direct function
matrixStats::colVars(mat)
Now for sparse matrices, you can also just call
sparseMatrixStats::colVars(sparse_mat)
If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.
I generate a dataset with 10,000 rows and 50 columns that is 99% empty
big_mat <- matrix(0, nrow=1e4, ncol=50) big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000) # Convert dense matrix to sparse matrix big_sparse_mat <- as(big_mat, "dgCMatrix")
I use the bench package to benchmark the performance difference:
bench::mark( sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat), matrixStats=matrixStats::colVars(big_mat), apply=apply(big_mat, 2, var) )
As you can see sparseMatrixStats
is ca. 50 times fast than matrixStats
, which in turn is 7 times faster than the apply()
version.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.