We begin by creating a function that allows us to easily run a pre-made test matrix against a variety of metrics to return runtime and error. Because we know hierarchical clustering to be the fastest plot to compute, we will use it as the confirmatory plot of our output.
library(BinaryMatrix) set.seed(1987)
mettest <- function(matrix, metric){ t1 <- Sys.time() #metchar <- cat("\"", metric , "\"" ) met <- binaryDistance(matrix, metric) t2 <- Sys.time() plot(hclust(met)) cat("Runtime for metric: ", t2-t1) } my.mets <- c("jaccard", "sokalMichener", "hamming", "russellRao", "pearson", "goodmanKruskal", "manhattan", "canberra", "binary", "euclid") options(try.outFile = stdout()) runmytests <- function(my.matrix){ for(i in 1:length(my.mets)){ cat("Test ", i, ": ", my.mets[i], "\n") try(mettest(my.matrix, my.mets[i])) cat("\n") } }
Next, we explore acceptable input for the binaryDistance function. We begin by creating a well-behaved test matrix and BinaryMatrix of 500x500 dimensions with randomly generated 0s and 1s, with p = 0.5.
goodmat <- matrix(rbinom(500*500, 1, 0.5), nrow = 500) goodf <- data.frame(1:500) goodbm <- BinaryMatrix(goodmat, goodf)
We use our function to confirm that we can run every metric ("jaccard", "sokalMichener", "hamming", "russellRao", "pearson", "goodmanKruskal", "manhattan", "canberra", "binary", "euclid") on our basic test matrix.
runmytests(goodmat)
We find that none of the binaryDistance metrics take the simple BinaryMatrix object type as valid input.
runmytests(goodbm)
However, we find that all of the binaryDistance metrics will take BinaryMatrix@binmat (the matrix component).
runmytests(goodbm@binmat)
Since we accept that a binary matrix in the form of BinaryMatrix@binmat or a matrix will work, we will use a simple matrix for the remainder of these tests.
Next, we attempt to form a series of binary matrices to challenge the limits of the binaryDistance function.
We begin by varying the proportion of 0s and 1s in the set: p = 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99.
The binaryDistance measures return error-free results at all levels. The hierarchical clustering visualization returns errors at p = 0.01 (pearson, canberra) and p = 0.99 (pearson and Goodman Kruskal).
There is evidence that some of the visualizations may behave strangely at the poles. We will need to test visualizations at the extremes: 0.01, 0.1, 0.5, 0.9, and 0.99.
ps <- c(0.01, 0.10, 0.30, 0.50, 0.70, 0.90, 0.99) for(i in 1:length(ps)){ cat("p = ", ps[i], "% \n") my.mat <- matrix(rbinom(500*500, 1, ps[i]), nrow = 500) runmytests(my.mat) }
Next, we vary the proportions of the matrix itself, continuing with a matrix the same size as we have been testing with. We create a long matrix (with very many columns) and a tall matrix (with very many rows).
A very tall matrix is successful, but a very long matrix fails. It appears that a 0.002:1 ratio of rows to columns succeeds with a variety of matrix sizes.
#For reference: runmytests(goodmat) verytall <- matrix(rbinom(500*500, 1, 0.5), ncol = 10) runmytests(verytall) verylong <- matrix(rbinom(500*500, 1, 0.5), nrow = 10) try(runmytests(verylong)) lesslong <- matrix(rbinom(500*500, 1, 0.5), nrow = 50) runmytests(lesslong) differentlylong <- matrix(rbinom(100*100, 1, 0.5), nrow=10) runmytests(differentlylong) twoperlong <- matrix(rbinom(100*100, 1, 0.5), nrow = 0.002*100*100) runmytests(twoperlong) twoagainlong <- matrix(rbinom(1000*1000, 1, 0.5), nrow = 0.002*1000*1000) runmytests(twoagainlong)
All the distance metrics tolerate a small number of duplicate rows or columns.
duprows <- rbind(goodmat[1:300, ], goodmat[200, ], goodmat[301:499, ]) runmytests(duprows) dupmanyrows <- rbind(goodmat[1:300, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[200, ], goodmat[301:499, ]) runmytests(dupmanyrows) dupcols <- cbind(goodmat[ , 1:300], goodmat[ , 200], goodmat[ , 301:499]) runmytests(dupcols) dupmanycols <- cbind(goodmat[ , 1:300], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 200], goodmat[ , 301:499]) runmytests(dupmanycols)
All metrics could handle rows of all 0s or 1s. All metrics other than Pearson can tolerate columns of all 0s or 1s.
all0 <- rep(0, 500) all1 <- rep(1, 500) row0 <- rbind(goodmat[1:300, ], all0, goodmat[301:499, ]) runmytests(row0) row1 <- rbind(goodmat[1:300, ], all1, goodmat[301:499, ]) runmytests(row1) col0 <- cbind(goodmat[ , 1:300], all0, goodmat[ , 301:499]) runmytests(col0) col1 <- cbind(goodmat[ , 1:300], all1, goodmat[ , 301:499]) runmytests(col1)
A matrix (including the BinaryMatrix object) accepts multiple classes of data. The tests above confirm that all 10 distance metrics work with a matrix of integers. Below we test matrices of numeric, logical, and character class containing values of 0 and 1.
All distance metrics work for binary numeric and logical matrices.
Manhattan, Canberra, Binary, and Euclidean distance return ostensibly meaningful output to a binary matrix of character class.
num.mat <- matrix(as.numeric(rbinom(500*500, 1, 0.5)), nrow = 500) runmytests(num.mat) log.mat <- matrix(as.logical(rbinom(500*500, 1, 0.5)), nrow = 500) runmytests(log.mat) char.matt <- matrix(as.character(rbinom(500*500, 1, 0.5)), nrow = 500) runmytests(char.matt)
How do the various distance metrics fare in the face of non-binary data?
We begin with the common case of "binary" data store as 1s and 2s.
All distance metrics return output without error, but some hierarchical clustering patterns are unusual.
two.mat <- goodmat + 1 runmytests(two.mat)
We take the case of a randomly generated set of probabilities between 0 and 1.
All distance metrics return output without error, but some hierarchical clustering patterns are unusual.
prob.mat <- matrix(runif(500*500, min = 0, max = 1), nrow = 500) runmytests(prob.mat)
We take the final case of a randomly generated set of numbers with a wide range.
All distance metrics return output without error, but some hierarchical clustering patterns are highly unusual.
wide.mat <- matrix(rnorm(500*500, sd = 1000), nrow = 500) runmytests(wide.mat)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.