TwoStepParam-class: Two step clustering with vector quantization

TwoStepParam-classR Documentation

Two step clustering with vector quantization

Description

For large datasets, we can perform vector quantization (e.g., with k-means clustering) to create centroids. These centroids are then subjected to a slower clustering technique such as graph-based community detection. The label for each cell is set to the label of the centroid to which it was assigned.

Usage

TwoStepParam(first = KmeansParam(centers = sqrt), second = NNGraphParam())

## S4 method for signature 'ANY,TwoStepParam'
clusterRows(x, BLUSPARAM, full = FALSE)

Arguments

first

A BlusterParam object specifying a fast vector quantization technique.

second

A BlusterParam object specifying the second clustering technique on the centroids.

x

A numeric matrix-like object where rows represent observations and columns represent variables.

BLUSPARAM

A KmeansParam object.

full

Logical scalar indicating whether the clustering statistics from both steps should be returned.

Details

Here, the idea is to use a fast clustering algorithm to perform vector quantization and reduce the size of the dataset, followed by a slower algorithm that aggregates the centroids for easier interpretation. The exact choice of the number of clusters is less relevant to the first clustering step as long as not too many centroids are generated but the clusters are still sufficiently granular. The second step can take more care (and computational time) summarizing the centroids into meaningful “meta-clusters”.

The default choice is to use k-means for the first step, with number of clusters set to the root of the number of observations; and graph-based clustering for the second step, which automatically detects a suitable number of clusters. K-means also eliminates density differences in the data that can introduce variable resolution from graph-based methods.

To modify an existing TwoStepParam object x, users can simply call x[[i]] or x[[i]] <- value where i is any argument used in the constructor.

Value

The TwoStepParam constructor will return a TwoStepParam object with the specified parameters.

The clusterRows method will return a factor of length equal to nrow(x) containing the cluster assignments. If full=TRUE, a list is returned with a clusters factor and an objects list containing:

  • first, a list of objects from the first clustering step. This is equal to the objects list in the output of clusterRows with the first BlusterParam.

  • centroids, a numeric matrix of centroids generated from the first clustering step.

  • second, a list of objects from the second clustering step on the centroids. This is equal to the objects list in the output of clusterRows with the second BlusterParam.

Author(s)

Aaron Lun

Examples

m <- matrix(runif(100000), ncol=10)
stuff <- clusterRows(m, TwoStepParam())
table(stuff)


LTLA/bluster documentation built on Sept. 8, 2024, 4:37 a.m.