knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(BiocStyle)
set.seed(0)

Overview

The BumpyMatrix class is a two-dimensional object where each entry contains a non-scalar object of constant type/class but variable length. This can be considered to be raggedness in the third dimension, i.e., "bumpiness". The BumpyMatrix is intended to represent complex data that has zero-to-many mappings between individual data points and each feature/sample, allowing us to store it in Bioconductor's standard 2-dimensional containers such as the SummarizedExperiment. One example could be to store transcript coordinates for highly multiplexed FISH data; the dimensions of the BumpyMatrix can represent genes and cells while each entry is a data frame with the relevant x/y coordinates.

Construction

A variety of BumpyMatrix subclasses are implemented but the most interesting is probably the BumpyDataFrameMatrix. This is an S4 matrix class where each entry is a DataFrame object, i.e., Bioconductor's wrapper around the data.frame. To demonstrate, let's mock up some data for our hypothetical FISH experiment:

library(S4Vectors)
df <- DataFrame(
    x=rnorm(10000), y=rnorm(10000), 
    gene=paste0("GENE_", sample(100, 10000, replace=TRUE)),
    cell=paste0("CELL_", sample(20, 10000, replace=TRUE))
)
df 

We then use the splitAsBumpyMatrix() utility to easily create our BumpyDataFrameMatrix based on the variables on the x- and y-axes. Here, each row is a gene, each column is a cell, and each entry holds all coordinates for that gene/cell combination.

library(BumpyMatrix)
mat <- splitAsBumpyMatrix(df[,c("x", "y")], row=df$gene, column=df$cell)
mat
mat[1,1][[1]]

We can also set sparse=TRUE to use a more efficient sparse representation, which avoids explicit storage of empty DataFrames. This may be necessary for larger datasets as there is a limit of r .Machine$integer.max (non-empty) entries in each BumpyMatrix.

chosen <- df[1:100,]
smat <- splitAsBumpyMatrix(chosen[,c("x", "y")], row=chosen$gene, 
    column=chosen$cell, sparse=TRUE)
smat

Basic operations

The BumpyMatrix implements many of the standard matrix operations, e.g., nrow(), dimnames(), the combining methods and transposition.

dim(mat)
dimnames(mat)
rbind(mat, mat)
cbind(mat, mat)
t(mat)

Subsetting will yield a new BumpyMatrix object corresponding to the specified submatrix. If the returned submatrix has a dimension of length 1 and drop=TRUE, the underlying CompressedList of values (in this case, the list of DataFrames) is returned.

mat[c("GENE_2", "GENE_20"),]
mat[,1:5]
mat["GENE_10",]

For BumpyDataFrameMatrix objects, we have an additional third index that allows us to easily extract an individual column of each DataFrame into a new BumpyMatrix. In the example below, we extract the x-coordinate into a new BumpyNumericMatrix:

out.x <- mat[,,"x"]
out.x
out.x[,1]

Common arithmetic and logical operations are already implemented for BumpyNumericMatrix subclasses. Almost all of these operations will act on each entry of the input object (or corresponding entries, for multiple inputs) and produce a new BumpyMatrix of the appropriate type.

pos <- out.x > 0
pos[,1]
shift <- 10 * out.x + 1
shift[,1]
out.y <- mat[,,"y"]
greater <- out.x < out.y
greater[,1]
diff <- out.y - out.x
diff[,1]

Advanced subsetting

When subsetting a BumpyMatrix, we can use another BumpyMatrix containing indexing information for each entry. Consider the following code chunk:

i <- mat[,,'x'] > 0 & mat[,,'y'] > 0
i
i[,1]
sub <- mat[i]
sub
sub[,1]

Here, i is a BumpyLogicalMatrix where each entry is a logical vector. When we do x[i], we effectively loop over the corresponding entries of x and i, using the latter to subset the DataFrame in the former. This produces a new BumpyDataFrameMatrix containing, in this case, only the observations with positive x- and y-coordinates.

For BumpyDataFrameMatrix objects, subsetting to a single field in the third dimension will automatically drop to the type of the underlying column of the DataFrame. This can be stopped with drop=FALSE to preserve the BumpyDataFrameMatrix output:

mat[,,'x']
mat[,,'x',drop=FALSE]

In situations where we want to drop the third dimension but not the first two dimensions (or vice versa), we use the .dropk argument. Setting .dropk=FALSE will ensure that the third dimension is not dropped, as shown below:

mat[1,1,'x']
mat[1,1,'x',.dropk=FALSE]
mat[1,1,'x',drop=FALSE]
mat[1,1,'x',.dropk=TRUE,drop=FALSE]

Subset replacement is also supported, which is most useful for operations to modify specific fields:

copy <- mat
copy[,,'x'] <- copy[,,'x'] * 20
copy[,1]

Additional operations

Some additional statistical operations are also implemented that will usually produce an ordinary matrix. Here, each entry corresponds to the statistic computed from the corresponding entry of the BumpyMatrix.

mean(out.x)[1:5,1:5] # matrix
var(out.x)[1:5,1:5] # matrix

The exception is with operations that naturally produce a vector, in which case a matching 3-dimensional array is returned:

quantile(out.x)[1:5,1:5,]
range(out.x)[1:5,1:5,]

Other operations may return another BumpyMatrix if the output length is variable:

pmax(out.x, out.y) 

BumpyCharacterMatrix objects also have their own methods for grep(), tolower(), etc. to manipulate the strings in a convenient manner.

Session information {-}

sessionInfo()


LTLA/BumpyMatrix documentation built on Dec. 27, 2024, 9:28 a.m.