clusterSingle | R Documentation |
Given input data, this function will find clusters, based on a single specification of parameters.
## S4 method for signature 'SummarizedExperiment'
clusterSingle(inputMatrix, ...)
## S4 method for signature 'ClusterExperiment'
clusterSingle(inputMatrix, ...)
## S4 method for signature 'SingleCellExperiment'
clusterSingle(
inputMatrix,
reduceMethod = "none",
nDims = defaultNDims(inputMatrix, reduceMethod),
whichAssay = 1,
...
)
## S4 method for signature 'matrixOrHDF5OrNULL'
clusterSingle(
inputMatrix,
inputType = "X",
subsample = FALSE,
sequential = FALSE,
distFunction = NA,
mainClusterArgs = NULL,
subsampleArgs = NULL,
seqArgs = NULL,
isCount = FALSE,
transFun = NULL,
reduceMethod = "none",
nDims = defaultNDims(inputMatrix, reduceMethod),
makeMissingDiss = if (ncol(inputMatrix) < 1000) TRUE else FALSE,
clusterLabel = "clusterSingle",
saveSubsamplingMatrix = FALSE,
checkDiss = FALSE,
warnings = TRUE
)
inputMatrix |
numerical matrix on which to run the clustering or a
|
... |
arguments to be passed on to the method for signature
|
reduceMethod |
character A character identifying what type of dimensionality reduction to perform before clustering. Options are 1) "none", 2) one of listBuiltInReducedDims() or listBuiltInFitlerStats OR 3) stored filtering or reducedDim values in the object. |
nDims |
integer An integer identifying how many dimensions to reduce to
in the reduction specified by |
whichAssay |
numeric or character specifying which assay to use. See
|
inputType |
a character vector defining what type of input is given in
the |
subsample |
logical as to whether to subsample via
|
sequential |
logical whether to use the sequential strategy (see details
of |
distFunction |
a distance function to be applied to |
mainClusterArgs |
list of arguments to be passed for the mainClustering
step, see help pages of |
subsampleArgs |
list of arguments to be passed to the subsampling step
(if |
seqArgs |
list of arguments to be passed to |
isCount |
if |
transFun |
a transformation function to be applied to the data. If the
transformation applied to the data creates an error or NA values, then the
function will throw an error. If object is of class
|
makeMissingDiss |
logical. Whether to calculate necessary distance
matrices needed when input is not "diss". If TRUE, then when a clustering
function calls for a inputType "diss", but the given matrix is of type "X",
the function will calculate a distance function. A dissimilarity matrix
will also be calculated if a post-processing argument like |
clusterLabel |
a string used to describe the clustering. By default it
is equal to "clusterSingle", to indicate that this clustering is the result
of a call to |
saveSubsamplingMatrix |
logical. If TRUE, the co-clustering matrix
resulting from subsampling is returned in the coClustering slot (and
replaces any existing coClustering object in the slot |
checkDiss |
logical. Whether to check whether the dissimilarities
matrices are valid (whether given by the user or calculated because
|
warnings |
logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters. |
clusterSingle
is an 'expert-oriented' function, intended to
be used when a user wants to run a single clustering and/or have a great
deal of control over the clustering parameters. Most users will find
clusterMany
more relevant. However, clusterMany
makes certain assumptions about the intention of certain combinations of
parameters that might not match the user's intent; similarly
clusterMany
does not directly take a dissimilarity matrix but
only a matrix of values x
(though a user can define a distance
function to be applied to x
in clusterMany
).
Unlike clusterMany
, most of the relevant arguments for
the actual clustering algorithms in clusterSingle
are passed to the
relevant steps via the arguments mainClusterArgs
,
subsampleArgs
, and seqArgs
. These arguments should be
named lists with parameters that match the corresponding functions:
mainClustering
,subsampleClustering
, and
seqCluster
. These three functions are not meant to be called
by the user, but rather accessed via calls to clusterSingle
. But the
user can look at the help files of those functions for more information
regarding the parameters that they take.
Only certain combinations of parameters are possible for certain
choices of sequential
and subsample
. These restrictions are
documented below.
clusterFunction
for
mainClusterArgs
: The choice of subsample=TRUE
also controls
what algorithm type of clustering functions can be used in the
mainClustering step. When subsample=TRUE
, then resulting
co-clustering matrix from subsampling is converted to a dissimilarity
(specificaly 1-coclustering values) and is passed to diss
of
mainClustering
. For this reason, the ClusterFunction
object given to mainClustering
via the argument
mainClusterArgs
must take input of the form of a dissimilarity. When
subsample=FALSE
and sequential=TRUE
, the
clusterFunction
passed in clusterArgs
element of
mainClusterArgs
must define a ClusterFunction
object with
algorithmType
'K'. When subsample=FALSE
and
sequential=FALSE
, then there are no restrictions on the
ClusterFunction
and that clustering is applied directly to the input
data.
clusterFunction
for subsampleArgs
: If the
ClusterFunction
object given to the clusterArgs
of
subsamplingArgs
is missing the algorithm will use the default for
subsampleClustering
(currently "pam"). If
sequential=TRUE
, this ClusterFunction
object must be of type
'K'.
Setting k
for subsampling: If subsample=TRUE
and sequential=TRUE
, the current K of the sequential iteration
determines the 'k' argument passed to subsampleClustering
so
setting 'k=' in the list given to the subsampleArgs will not do anything
and will produce a warning to that effect (see documentation of
seqCluster
).
Setting k
for mainClustering step:
If sequential=TRUE
then the user should not set k
in the
clusterArgs
argument of mainClusterArgs
because it must be
set by the sequential code, which has a iterative reseting of the
parameters. Specifically if subsample=FALSE
, then the sequential
method iterates over choices of k
to cluster the input data. And if
subsample=TRUE
, then the k
in the clustering of
mainClustering step (assuming the clustering function is of type 'K') will
use the k
used in the subsampling step to make sure that the
k
used in the mainClustering step is reasonable.
Setting
findBestK
in mainClusterArgs
: If sequential=TRUE
and
subsample=FALSE
, the user should not set 'findBestK=TRUE' in
mainClusterArgs
. This is because in this case the sequential method
changes k
; an error message will be given if this combination of
options are set. However, if sequential=TRUE
and
subsample=TRUE
, then passing either 'findBestK=TRUE' or
'findBestK=FALSE' via mainClusterArgs
will function as expected
(assuming the clusterFunction
argument passed to
mainClusterArgs
is of type 'K'). In particular, the sequential step
will set the number of clusters k
for clustering of each subsample.
If findBestK=FALSE, that same k
will be used for mainClustering step
that clusters the resulting co-occurance matrix after subsampling. If
findBestK=TRUE, then mainClustering
will search for best k.
Note that the default 'kRange' over which mainClustering
searches when findBestK=TRUE depends on the input value of k
which
is set by the sequential method if sequential=TRUE
), see above. The
user can change kRange
to not depend on k
and to be fixed
across all of the sequential steps by setting kRange
explicitly in
the mainClusterArgs
list.
To provide a distance matrix via the argument distFunction
,
the function must be defined to take the distance of the rows of a matrix
(internally, the function will call distFunction(t(x))
. This is to
be compatible with the input for the dist
function. as.matrix
will be performed on the output of distFunction
, so if the object
returned has a as.matrix
method that will convert the output into a
symmetric matrix of distances, this is fine (for example the class
dist
for objects returned by dist
have such a method). If
distFunction=NA
, then a default distance will be calculated based on
the type of clustering algorithm of clusterFunction
. For type "K"
the default is to take dist
as the distance function. For type "01",
the default is to take the (1-cor(x))/2.
A ClusterExperiment
object if
inputType
is of type "X".
If input was not of type "X", then the result is a list with values
clustering: The vector of clustering results
clusterInfo: A list with information about the parameters run in the clustering
coClusterMatrix: (only if saveSubsamplingMatrix=TRUE
, NxB set of clusterings obtained after B subsamples.
clusterMany
to compare multiple choices of parameters,
and mainClustering
,subsampleClustering
, and
seqCluster
for the underlying functions called by
clusterSingle
.
data(simData)
## Not run:
#following code takes some time.
#use clusterSingle to do sequential clustering
#(same as example in seqCluster only using clusterSingle ...)
set.seed(44261)
clustSeqHier_v2 <- clusterSingle(simData,
sequential=TRUE, subsample=TRUE,
subsampleArgs=list(resamp.n=100, samp.p=0.7,
clusterFunction="kmeans", clusterArgs=list(nstart=10)),
seqArgs=list(beta=0.8, k0=5), mainClusterArgs=list(minSize=5,
clusterFunction="hierarchical01",clusterArgs=list(alpha=0.1)))
## End(Not run)
#use clusterSingle to do just clustering k=3 with no subsampling
clustObject <- clusterSingle(simData,
subsample=FALSE, sequential=FALSE,
mainClusterArgs=list(clusterFunction="pam", clusterArgs=list(k=3)))
#compare to standard pam
pamOut<-cluster::pam(t(simData),k=3,cluster.only=TRUE)
all(pamOut==primaryCluster(clustObject))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.