knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE) library(BiocStyle)
The r Biocpkg("bluster")
package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows.
At its core is the clusterRows()
generic that controls dispatch to different clustering algorithms.
We will demonstrate on some single-cell RNA sequencing data from the r Biocpkg("scRNAseq")
package;
our aim is to cluster cells into cell populations based on their PC coordinates.
library(scRNAseq) sce <- ZeiselBrainData() # Trusting the authors' quality control, and going straight to normalization. library(scuttle) sce <- logNormCounts(sce) # Feature selection based on highly variable genes. library(scran) dec <- modelGeneVar(sce) hvgs <- getTopHVGs(dec, n=1000) # Dimensionality reduction for work (PCA) and pleasure (t-SNE). set.seed(1000) library(scater) sce <- runPCA(sce, ncomponents=20, subset_row=hvgs) sce <- runUMAP(sce, dimred="PCA") mat <- reducedDim(sce, "PCA") dim(mat)
Our first algorithm is good old hierarchical clustering, as implemented using hclust()
from the stats package.
This automatically sets the cut height to half the dendrogram height.
library(bluster) hclust.out <- clusterRows(mat, HclustParam()) plotUMAP(sce, colour_by=I(hclust.out))
Advanced users can achieve greater control of the procedure by passing more parameters to the HclustParam()
constructor.
Here, we use Ward's criterion for the agglomeration with a dynamic tree cut from the r CRANpkg("dynamicTreeCut")
package.
hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE) hp2 hclust.out <- clusterRows(mat, hp2) plotUMAP(sce, colour_by=I(hclust.out))
Our next algorithm is $k$-means clustering, as implemented using the kmeans()
function.
This requires us to pass in the number of clusters, either as a number:
set.seed(100) kmeans.out <- clusterRows(mat, KmeansParam(10)) plotUMAP(sce, colour_by=I(kmeans.out))
Or as a function of the number of observations, which is useful for vector quantization purposes:
kp <- KmeansParam(sqrt) kp set.seed(100) kmeans.out <- clusterRows(mat, kp) plotUMAP(sce, colour_by=I(kmeans.out))
We can build shared or direct nearest neighbor graphs and perform community detection with r CRANpkg("igraph")
.
Here, the number of neighbors k
controls the resolution of the clusters.
set.seed(101) # just in case there are ties. graph.out <- clusterRows(mat, NNGraphParam(k=10)) plotUMAP(sce, colour_by=I(graph.out))
It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use.
set.seed(101) # just in case there are ties. np <- NNGraphParam(k=20, cluster.fun="louvain") np graph.out <- clusterRows(mat, np) plotUMAP(sce, colour_by=I(graph.out))
We also provide a wrapper for a hybrid "two-step" approach for handling larger datasets. Here, a fast agglomeration is performed with $k$-means to compact the data, followed by a slower graph-based clustering step to obtain interpretable meta-clusters. (This dataset is not, by and large, big enough for this approach to work particularly well.)
set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out))
Each step is itself parametrized by BlusterParam
objects, so it is possible to tune them individually:
twop <- TwoStepParam(second=NNGraphParam(k=5)) twop set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out))
Sometimes the vector of cluster assignments is not enough.
We can obtain more information about the clustering procedure by setting full=TRUE
in clusterRows()
.
For example, we could obtain the actual graph generated by NNGraphParam()
:
nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE) nn.out$objects$graph
The assignment vector is still reported in the clusters
entry:
table(nn.out$clusters)
clusterRows()
enables users or developers to easily switch between clustering algorithms by changing a single argument.
Indeed, by passing the BlusterParam
object across functions, we can ensure that the same algorithm is used through a workflow.
It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature.
However, the true power of this framework lies in its extensibility.
Anyone can write a clusterRows()
method for a new clustering algorithm with an associated BlusterParam
subclass,
and that procedure is immediately compatible with any workflow or function that was already using clusterRows()
.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.