title: "Introduction to Silly Putty"
author: "Dwayne Tally, Zachary B. Abrams, and Kevin R. Coombes"
date: "r Sys.Date()
"
output:
rmarkdown::html_document:
theme: journal
highlight: kate
vignette: >
%\VignetteIndexEntry{Introduction to SillyPutty}
%\VignetteKeywords{SillyPutty,Clustering Algorithm,Clusters,Graphics}
%\VignetteDepends{SillyPutty}
%\VignettePackage{SillyPutty}
%\VignetteEngine{knitr::rmarkdown}
knitr::opts_chunk$set(fig.width=8, fig.height=8) options(width=96) .format <- knitr::opts_knit$get("rmarkdown.pandoc.to") .tag <- function(N, cap ) ifelse(.format == "html", paste("Figure", N, ":", cap), cap)
cat(' <style type="text/css"> .figure { text-align: center; } .caption { font-weight: bold; } </style> ')
In many diseases, such as cancer, it is important to have a clear understanding of what potential clinical subgroup an individual patient belongs to. Unsupervised clustering is a useful analytic tool to address this problem. A variety of clustering methods already exist and are differentiated by the kinds of outcome measures they are intended to optimize. For example, K-means is designed to minimize the within-cluster sum of square errors. Partitioning around medoids generalizes this idea from the Euclidean distance metric defined by sums of squares to an arbitrary distance metric. The different linkage rules used in hierarchical clustering methods also change the nature of the value being optimized.
Ever since Kaufmann and Rooseeuw introduced the idea of the silhouette width, researches have used its average value to select the best method when applying different clustering methods to the same data set. To out knowledge, no one has tried to use silhouette width as a quantity to be optimized directly when finding clusters. To test the idea that optimizing the silhouette width could be used to cluster elements, we developed a novel algorithm that we call "SillyPutty". In brief, after elements have been assigned to clusters, we can calculate the silhouette width (SW) of each element, yielding numbers between -1 and +1. A positive value of SW indicates that an element is likely to be properly clustered, while a negative value of SW indicates the element is probably not in the correct cluster. The repeated step in the SillyPutty algorithm is to reclassify the element with the most negative silhouette width by placing it into the cluster to which it is closest. This process can (usually) be repeated until there are no negative silhouette widths present in the data. (There is a small chance that this algorithm will fail to converge by entering a small infinite loop where the same elements are rearranged to get back to an earlier configuration.)
We must first load the necessary packages.
library(SillyPutty) library(Umpire) suppressMessages( library(Mercator) ) suppressMessages( library(mclust) ) # for adjusted rand index
We use the Umpire R package (version r packageVersion("Umpire")
) to generate more complex and realistic
synthetic data. We then compute the Euclidean distances between elements. Then, we use the Mercator
R package (version r packageVersion("Mercator")
) to visualize the data. Finally, we use the mclust R package
(version r packageVersion("mclust")
) to compute the Adjusted Rand index (ARI), a measure of cluster quality
that compares clusters to externally known truth.
The next chunk of code creates the objects that we will use to simulate a data set. We set things up to represent a kind of cancer with four subtypes corresponding to recurrent sets of "hits", where each hit can be thought of as an abstract "mutation" that affects the expression of a pathway of related genes.
set.seed(21315) trueK <- 4 ## Set up survival outcome; baseline is exponential sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1) ## Build a CancerModel with four subtypes nBlocks <- 20 cm <- CancerModel(name="cansim", nPossible=nBlocks, nPattern=trueK, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) ## Include 100 blocks/pathways that are not hit by cancer nTotalBlocks <- nBlocks + 100 ## Assign values to hyperparameters ## block size blockSize <- round(rnorm(nTotalBlocks, 100, 30)) ## log normal mean hyperparameters mu0 <- 6 sigma0 <- 1.5 ## log normal sigma hyperparameters rate <- 28.11 shape <- 44.25 ## block correlation p <- 0.6 w <- 5 ## Set up the baseline Engine rho <- rbeta(nTotalBlocks, p*w, (1-p)*w) base <- lapply(1:nTotalBlocks, function(i) { bs <- blockSize[i] co <- matrix(rho[i], nrow=bs, ncol=bs) diag(co) <- 1 mu <- rnorm(bs, mu0, sigma0) sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1) covo <- co *(t(sigma) %*% sigma) MVN(mu, covo) }) eng <- Engine(base) ## Alter the means if there is a hit altered <- alterMean(eng, normalOffset, delta=0, sigma=1) ## Build the CancerEngine using character strings object <- CancerEngine(cm, "eng", "altered") rm(sm, nBlocks, cm, nTotalBlocks, blockSize, mu0, sigma0, rate, shape, p, w, rho, base, eng, altered)
Now we can take a random sample of 144 elements from the distribution that we just defined.
trueN <- 144 dset <- rand(object, trueN, keepall = TRUE) # contains two objects labels <- dset$clinical$CancerSubType # the true clusters/types d1 <- dset$data # the noise-free simulated data
To make our data set even more realistic, we are going to add noise that mimics what happens in some biological assays.
SpecialNoise <- function(nFeat, nu = 0.1, shape = 1.02, scale = 0.05/shape) { NoiseModel(nu = nu, tau = rgamma(nFeat, shape = shape, scale = scale), phi = 0) } nm <- SpecialNoise(nrow(d1), nu = 0) d1 <- blur(nm, d1) dim(d1)
Now we compute the Euclidean distances between pairs of elements in our simulated data set.
tdis <- t(d1) dimnames(tdis) <- list(paste("Sample", 1:nrow(tdis), sep=''), paste("Feature", 1:ncol(tdis), sep='')) dis <- dist(tdis) ## This step is the rate-liomiting factor. Only way to speed up is to use fewerw samples names(labels) <- rownames(tdis)
dataset <- tdis eucdist <- dis trueGroups <- labels save(eucdist, trueGroups, file="../data/eucdist.rda")
As noted above, we will use the Mercator package for visualization. This function will ensure that we generate consistent sets of pictures.
mercViews <- function(object, main, tag = NULL) { opar <- par(mfrow = c(2, 2)) on.exit(par(opar)) pts <- barplot(object, main = main) if (!is.null(tag)) { gt <- as.vector(as.matrix(table(getClusters(object)))) loc <- pts[round((c(0, cumsum(gt))[-(1 + length(gt))] + cumsum(gt))/2)] mtext(tag, side =1, line = 0, at = loc, col = object@palette, font = 2) } plot(object, view = "tsne", main = "t-SNE") plot(object, view = "hclust") plot(object, view = "mds", main = "MDS") }
We will apply various clustering methods to the data (represented primarily through its distance matrix). We want to demonstrate with this example that SillyPutty clustering can do a better job than hierarchical clustering or PAM.
Figure 1 presents multiple views of the Euclidean distances between our simulated data. Since
we know that we started with r trueK
clusters, we chose that as the number to find using the default method
of hierarchical clustering with Ward's linkage rule. (We will later illustrate how to use SillyPutty
to find the number of clusters.)
The silhouette width plot in the upper left panel indicates that each of the clusters contains some poorly-classified elements, identified by their negative silhouette widths. Both the multidimensional scaling (MDS) plot in the lower right and the t-stochastic neighbor embedding (t-SNE) plot in the upper right clearly display colored points that appear to be in the wrong regions.
set.seed(1987) vis <- Mercator(dis, "euclid", "hclust", K = trueK) palette <- vis@palette[c(1:3, 7, 8, 6, 10, 4, 11, 5, 15, 14, 17:18, 9, 12, 16, 19:24)] vis@palette <- palette vis <- addVisualization(vis, "mds") vis <- addVisualization(vis, "tsne") mercViews(vis, "Hierarchical Clustering, Five Clusters")
The adjusted Rand index isn't very good, either.
ari.hier <- adjustedRandIndex(labels, vis@clusters) ari.hier
Since we know the truth, we can reassign the clusters inside the Mercator object to see what everything is supposed to look like (Figure 2). Notice that the silhouette width plot agrees that everything is in the right place, and that the MDS and t-SNE plots are also consistent.
truebin <- remapColors(vis, setClusters(vis, labels)) mercViews(truebin, main = "True Cluster Types", tag = unique(sort(labels)))
Here we apply PAM clustering to the same distance matrix (Figure 3). These results are clearly much worse than hierarchical clustering.
pc <- pam(dis, k = trueK, diss=TRUE) pamc <- remapColors(vis, setClusters(vis, pc$clustering)) mercViews(pamc, main = "PAM, K = 4", tag = paste("P", 1:trueK, sep = "")) ari.pam <- adjustedRandIndex(labels, pamc@clusters) ari.pam
RandomSillyPutty is the core of the SillyPutty package. It takes a distance matrix, the desired number of clusters K, and the number N of times you want to apply SillyPutty to the data set. Each time, you start with different random cluster assignments. You then apply the "move the worst element" algorithm described above. RandomSillyPutty saves the best and worst silhouette width scores along with their associated data clusters.
set.seed(12) y2 <- suppressWarnings(RandomSillyPutty(dis, trueK, N = 100)) ## this is also slow ari.max <- adjustedRandIndex(truebin@clusters, y2@MX) ari.min <- adjustedRandIndex(truebin@clusters, y2@MN) ari.max ari.min
The adjusted rand index of the best SillyPutty clustering is 0.98, meaning that we have almost completely recovered the true cluster assignments present in the data. Note that even the worst result that we obtained from SillyPutty has a better ARI (0.61) than PAM (0.43), though not quite as good as hierarchical clustering (0.71).
We can now update the Mercator object using the cluster assignments defined by the best SillyPutty result (Figure 4). The silhouette width plot now says that it thinks all elements are in good clusters, and both the MDS and t-SNE plots support that conclusion.
randSillyBin <- remapColors(vis, setClusters(vis, y2@MX)) mercViews(randSillyBin, main = "SillyPutty Cluster Types, K = 4", tag = paste("C", 1:trueK, sep = ""))
We can also plot the cluster assignments that had the maximum and minimum silhouette widths from running the \code{RandokmSillyPutty} algorithm. We will use the multidimensional scaling layout and the color palette from the Mercator object.
plot(y2, randSillyBin@view[["mds"]], distobj = dis, col = randSillyBin@palette) summary(y2)
Since SillyPutty can start with any existing cluster assignments (even random ones, as we just saw), we can combine it with any other method. here we are going to start with the results of hierarchcial clustering, and just take one pass of SillyPutty to "improve" its results. To apply SillyPutty to an already precomputed clustering algorithm, you have to have the cluster identities of the clustering algorithm and the distance matrix of the data set. SillyPutty will then recalculate the clusters from a starting point within the post-clustered clusters and return the best silhouette width score and the new cluster identities.
hierSilly <- SillyPutty(vis@clusters, dis) hierSillyBin <- remapColors(vis, setClusters(vis, hierSilly@cluster)) mercViews(hierSillyBin, main = "HClust + Silly, k = 4",tag = paste("C", 1:trueK, sep = "")) ari.Sillyhier <- adjustedRandIndex(labels, hierSillyBin@clusters) ari.Sillyhier
RangeSillyPutty uses RandomSillyPutty to determine the best mean silhouette width, for a range of clusters values. Then you can use the best silhouette widths to apporximate the actual number of clusters within the dataset.
Figure 6 shows the best silhouette width achieved with each possible number of clusters. The best overall value occurs when K = 4, which is the true number of clusters.
y <- findClusterNumber(dis, start = 2, end = 12, method = "HCSP") plot(names(y), y, xlab = "K", ylab = "Silhouette Width", type = "b", lwd = 2, pch = 16)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.