clusterSingle: General wrapper method to cluster the data

clusterSingleR Documentation

General wrapper method to cluster the data

Description

Given input data, this function will find clusters, based on a single specification of parameters.

Usage

## S4 method for signature 'SummarizedExperiment'
clusterSingle(inputMatrix, ...)

## S4 method for signature 'ClusterExperiment'
clusterSingle(inputMatrix, ...)

## S4 method for signature 'SingleCellExperiment'
clusterSingle(
  inputMatrix,
  reduceMethod = "none",
  nDims = defaultNDims(inputMatrix, reduceMethod),
  whichAssay = 1,
  ...
)

## S4 method for signature 'matrixOrHDF5OrNULL'
clusterSingle(
  inputMatrix,
  inputType = "X",
  subsample = FALSE,
  sequential = FALSE,
  distFunction = NA,
  mainClusterArgs = NULL,
  subsampleArgs = NULL,
  seqArgs = NULL,
  isCount = FALSE,
  transFun = NULL,
  reduceMethod = "none",
  nDims = defaultNDims(inputMatrix, reduceMethod),
  makeMissingDiss = if (ncol(inputMatrix) < 1000) TRUE else FALSE,
  clusterLabel = "clusterSingle",
  saveSubsamplingMatrix = FALSE,
  checkDiss = FALSE,
  warnings = TRUE
)

Arguments

inputMatrix

numerical matrix on which to run the clustering or a SummarizedExperiment, SingleCellExperiment, or ClusterExperiment object.

...

arguments to be passed on to the method for signature matrix.

reduceMethod

character A character identifying what type of dimensionality reduction to perform before clustering. Options are 1) "none", 2) one of listBuiltInReducedDims() or listBuiltInFitlerStats OR 3) stored filtering or reducedDim values in the object.

nDims

integer An integer identifying how many dimensions to reduce to in the reduction specified by reduceMethod. Defaults to output of defaultNDims

whichAssay

numeric or character specifying which assay to use. See assay for details.

inputType

a character vector defining what type of input is given in the inputMatrix argument. Must consist of values "diss","X", or "cat" (see details). "X" and "cat" should be indicate matrices with features in the row and samples in the column; "cat" corresponds to the features being numerical integers corresponding to categories, while "X" are continuous valued features. "diss" corresponds to an inputMatrix that is a NxN dissimilarity matrix. "cat" is largely used internally for clustering of sets of clusterings.

subsample

logical as to whether to subsample via subsampleClustering. If TRUE, clustering in mainClustering step is done on the co-occurance between clusterings in the subsampled clustering results. If FALSE, the mainClustering step will be run directly on x/diss

sequential

logical whether to use the sequential strategy (see details of seqCluster). Can be used in combination with subsample=TRUE or FALSE.

distFunction

a distance function to be applied to inputMatrix. Only relevant if inputType="X". See details of clusterSingle for the required format of the distance function.

mainClusterArgs

list of arguments to be passed for the mainClustering step, see help pages of mainClustering.

subsampleArgs

list of arguments to be passed to the subsampling step (if subsample=TRUE), see help pages of subsampleClustering.

seqArgs

list of arguments to be passed to seqCluster.

isCount

if transFun=NULL, then isCount=TRUE will determine the transformation as defined by function(x){log2(x+1)}, and isCount=FALSE will give a transformation function function(x){x}. Ignored if transFun=NULL. If object is of class ClusterExperiment, the stored transformation will be used and giving this parameter will result in an error.

transFun

a transformation function to be applied to the data. If the transformation applied to the data creates an error or NA values, then the function will throw an error. If object is of class ClusterExperiment, the stored transformation will be used and giving this parameter will result in an error.

makeMissingDiss

logical. Whether to calculate necessary distance matrices needed when input is not "diss". If TRUE, then when a clustering function calls for a inputType "diss", but the given matrix is of type "X", the function will calculate a distance function. A dissimilarity matrix will also be calculated if a post-processing argument like findBestK or removeSil is chosen, since these rely on calcualting silhouette widths from distances.

clusterLabel

a string used to describe the clustering. By default it is equal to "clusterSingle", to indicate that this clustering is the result of a call to clusterSingle.

saveSubsamplingMatrix

logical. If TRUE, the co-clustering matrix resulting from subsampling is returned in the coClustering slot (and replaces any existing coClustering object in the slot coClustering if input object is a ClusterExperiment object.)

checkDiss

logical. Whether to check whether the dissimilarities matrices are valid (whether given by the user or calculated because makeMissingDiss=TRUE).

warnings

logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters.

Details

clusterSingle is an 'expert-oriented' function, intended to be used when a user wants to run a single clustering and/or have a great deal of control over the clustering parameters. Most users will find clusterMany more relevant. However, clusterMany makes certain assumptions about the intention of certain combinations of parameters that might not match the user's intent; similarly clusterMany does not directly take a dissimilarity matrix but only a matrix of values x (though a user can define a distance function to be applied to x in clusterMany).

Unlike clusterMany, most of the relevant arguments for the actual clustering algorithms in clusterSingle are passed to the relevant steps via the arguments mainClusterArgs, subsampleArgs, and seqArgs. These arguments should be named lists with parameters that match the corresponding functions: mainClustering,subsampleClustering, and seqCluster. These three functions are not meant to be called by the user, but rather accessed via calls to clusterSingle. But the user can look at the help files of those functions for more information regarding the parameters that they take.

Only certain combinations of parameters are possible for certain choices of sequential and subsample. These restrictions are documented below.

  • clusterFunction for mainClusterArgs: The choice of subsample=TRUE also controls what algorithm type of clustering functions can be used in the mainClustering step. When subsample=TRUE, then resulting co-clustering matrix from subsampling is converted to a dissimilarity (specificaly 1-coclustering values) and is passed to diss of mainClustering. For this reason, the ClusterFunction object given to mainClustering via the argument mainClusterArgs must take input of the form of a dissimilarity. When subsample=FALSE and sequential=TRUE, the clusterFunction passed in clusterArgs element of mainClusterArgs must define a ClusterFunction object with algorithmType 'K'. When subsample=FALSE and sequential=FALSE, then there are no restrictions on the ClusterFunction and that clustering is applied directly to the input data.

  • clusterFunction for subsampleArgs: If the ClusterFunction object given to the clusterArgs of subsamplingArgs is missing the algorithm will use the default for subsampleClustering (currently "pam"). If sequential=TRUE, this ClusterFunction object must be of type 'K'.

  • Setting k for subsampling: If subsample=TRUE and sequential=TRUE, the current K of the sequential iteration determines the 'k' argument passed to subsampleClustering so setting 'k=' in the list given to the subsampleArgs will not do anything and will produce a warning to that effect (see documentation of seqCluster).

  • Setting k for mainClustering step: If sequential=TRUE then the user should not set k in the clusterArgs argument of mainClusterArgs because it must be set by the sequential code, which has a iterative reseting of the parameters. Specifically if subsample=FALSE, then the sequential method iterates over choices of k to cluster the input data. And if subsample=TRUE, then the k in the clustering of mainClustering step (assuming the clustering function is of type 'K') will use the k used in the subsampling step to make sure that the k used in the mainClustering step is reasonable.

  • Setting findBestK in mainClusterArgs: If sequential=TRUE and subsample=FALSE, the user should not set 'findBestK=TRUE' in mainClusterArgs. This is because in this case the sequential method changes k; an error message will be given if this combination of options are set. However, if sequential=TRUE and subsample=TRUE, then passing either 'findBestK=TRUE' or 'findBestK=FALSE' via mainClusterArgs will function as expected (assuming the clusterFunction argument passed to mainClusterArgs is of type 'K'). In particular, the sequential step will set the number of clusters k for clustering of each subsample. If findBestK=FALSE, that same k will be used for mainClustering step that clusters the resulting co-occurance matrix after subsampling. If findBestK=TRUE, then mainClustering will search for best k. Note that the default 'kRange' over which mainClustering searches when findBestK=TRUE depends on the input value of k which is set by the sequential method if sequential=TRUE), see above. The user can change kRange to not depend on k and to be fixed across all of the sequential steps by setting kRange explicitly in the mainClusterArgs list.

To provide a distance matrix via the argument distFunction, the function must be defined to take the distance of the rows of a matrix (internally, the function will call distFunction(t(x)). This is to be compatible with the input for the dist function. as.matrix will be performed on the output of distFunction, so if the object returned has a as.matrix method that will convert the output into a symmetric matrix of distances, this is fine (for example the class dist for objects returned by dist have such a method). If distFunction=NA, then a default distance will be calculated based on the type of clustering algorithm of clusterFunction. For type "K" the default is to take dist as the distance function. For type "01", the default is to take the (1-cor(x))/2.

Value

A ClusterExperiment object if inputType is of type "X".

If input was not of type "X", then the result is a list with values

  • clustering: The vector of clustering results

  • clusterInfo: A list with information about the parameters run in the clustering

  • coClusterMatrix: (only if saveSubsamplingMatrix=TRUE, NxB set of clusterings obtained after B subsamples.

See Also

clusterMany to compare multiple choices of parameters, and mainClustering,subsampleClustering, and seqCluster for the underlying functions called by clusterSingle.

Examples

data(simData)

## Not run: 
#following code takes some time.
#use clusterSingle to do sequential clustering
#(same as example in seqCluster only using clusterSingle ...)
set.seed(44261)
clustSeqHier_v2 <- clusterSingle(simData,
     sequential=TRUE, subsample=TRUE, 
     subsampleArgs=list(resamp.n=100, samp.p=0.7,
     clusterFunction="kmeans", clusterArgs=list(nstart=10)),
     seqArgs=list(beta=0.8, k0=5), mainClusterArgs=list(minSize=5,
     clusterFunction="hierarchical01",clusterArgs=list(alpha=0.1)))

## End(Not run)

#use clusterSingle to do just clustering k=3 with no subsampling
clustObject <- clusterSingle(simData,
    subsample=FALSE, sequential=FALSE,
    mainClusterArgs=list(clusterFunction="pam", clusterArgs=list(k=3)))
#compare to standard pam
pamOut<-cluster::pam(t(simData),k=3,cluster.only=TRUE)
all(pamOut==primaryCluster(clustObject))

epurdom/clusterExperiment documentation built on April 28, 2024, 8:17 p.m.