clusterMany | R Documentation |
Given a range of parameters, this function will return a matrix with the
clustering of the samples across the range, which can be passed to
plotClusters
for visualization.
## S4 method for signature 'matrixOrHDF5'
clusterMany(
x,
reduceMethod = "none",
nReducedDims = NA,
transFun = NULL,
isCount = FALSE,
...
)
## S4 method for signature 'SingleCellExperiment'
clusterMany(
x,
ks = NA,
clusterFunction,
reduceMethod = "none",
nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
alphas = 0.1,
findBestK = FALSE,
sequential = FALSE,
removeSil = FALSE,
subsample = FALSE,
silCutoff = 0,
distFunction = NA,
betas = 0.9,
minSizes = 1,
transFun = NULL,
isCount = FALSE,
verbose = TRUE,
parameterWarnings = FALSE,
mainClusterArgs = NULL,
subsampleArgs = NULL,
seqArgs = NULL,
whichAssay = 1,
makeMissingDiss = if (ncol(x) < 1000) TRUE else FALSE,
ncores = 1,
random.seed = NULL,
run = TRUE,
...
)
## S4 method for signature 'ClusterExperiment'
clusterMany(
x,
reduceMethod = "none",
nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
eraseOld = FALSE,
...
)
## S4 method for signature 'SummarizedExperiment'
clusterMany(x, ...)
## S4 method for signature 'data.frame'
clusterMany(x, ...)
x |
the data matrix on which to run the clustering. Can be object of the
following classes: matrix (with genes in rows),
|
reduceMethod |
character A character identifying what type of dimensionality reduction to perform before clustering. Options are 1) "none", 2) one of listBuiltInReducedDims() or listBuiltInFitlerStats OR 3) stored filtering or reducedDim values in the object. |
nReducedDims |
vector of the number of dimensions to use (when
|
transFun |
a transformation function to be applied to the data. If the
transformation applied to the data creates an error or NA values, then the
function will throw an error. If object is of class
|
isCount |
if |
... |
For signature |
ks |
the range of k values (see details for the meaning of |
clusterFunction |
function used for the clustering. This must be either
1) a character vector of built-in clustering techniques, or 2) a
named list of |
nFilterDims |
vector of the number of the most variable features to keep
(when "var", "abscv", or "mad" is identified in |
alphas |
values of alpha to be tried. Only used for clusterFunctions of
type '01'. Determines tightness required in creating clusters from the
dissimilarity matrix. Takes on values in [0,1]. See documentation of
|
findBestK |
logical, whether should find best K based on average silhouette width (only used when clusterFunction of type "K"). |
sequential |
logical whether to use the sequential strategy (see details
of |
removeSil |
logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K") |
subsample |
logical as to whether to subsample via
|
silCutoff |
Requirement on minimum silhouette width to be included in cluster (only for combinations where removeSil=TRUE). |
distFunction |
a vector of character strings that are the names of
distance functions found in the global environment. See the help pages of
|
betas |
values of |
minSizes |
the minimimum size required for a cluster (in the
|
verbose |
logical. If TRUE it will print informative messages. |
parameterWarnings |
logical, as to whether warnings and comments from checking the validity of the parameter combinations should be printed. |
mainClusterArgs |
list of arguments to be passed for the mainClustering
step, see help pages of |
subsampleArgs |
list of arguments to be passed to the subsampling step
(if |
seqArgs |
list of arguments to be passed to |
whichAssay |
numeric or character specifying which assay to use. See
|
makeMissingDiss |
logical. Whether to calculate necessary distance
matrices needed when input is not "diss". If TRUE, then when a clustering
function calls for a inputType "diss", but the given matrix is of type "X",
the function will calculate a distance function. A dissimilarity matrix
will also be calculated if a post-processing argument like |
ncores |
the number of threads |
random.seed |
a value to set seed before each run of clusterSingle (so that all of the runs are run on the same subsample of the data). Note, if 'random.seed' is set, argument 'ncores' should NOT be passed via subsampleArgs; instead set the argument 'ncores' of clusterMany directly (which is preferred for improving speed anyway). |
run |
logical. If FALSE, doesn't run clustering, but just returns matrix
of parameters that will be run, for the purpose of inspection by user (with
rownames equal to the names of the resulting column names of clMat object
that would be returned if |
eraseOld |
logical. Only relevant if input |
Some combinations of these parameters are not feasible. See the
documentation of clusterSingle
for important information on
how these parameter choices interact.
While the function allows for multiple values of clusterFunction,
the code does not reuse the same subsampling matrix and try different
clusterFunctions on it. This is because if sequential=TRUE, different
subsample clusterFunctions will create different sets of data to subsample
so it is not possible; if sequential=FALSE, we have not implemented
functionality for this reuse. Setting the random.seed
value,
however, should mean that the subsampled matrix is the same for each, but
there is no gain in computational complexity (i.e. each subsampled
co-occurence matrix is recalculated for each set of parameters).
The argument ks
is interpreted differently for different
choices of the other parameters. When/if sequential=TRUE, ks
defines
the argument k0
of seqCluster
. Otherwise, ks
values are the k
values for both the mainClustering and
subsampling step (i.e. assigned to the subsampleArgs
and
mainClusterArgs
that are passed to mainClustering
and
subsampleClustering
unless k
is set appropriately in
subsampleArgs
. The passing of these arguments via
subsampleArgs
will only have an effect if 'subsample=TRUE'.
Similarly, the passing of mainClusterArgs[["k"]]
will only have an
effect when the clusterFunction argument includes a clustering algorithm of
type "K". When/if "findBestK=TRUE", ks
also defines the
kRange
argument of mainClustering
unless kRange
is specified by the user via the mainClusterArgs
; note this means
that the default option of setting kRange
that depends on the input
k
(see mainClustering
) is not available in
clusterMany
, only in clusterSingle
.
If the input is a ClusterExperiment
object, current
implementation is that existing orderSamples
,coClustering
or
the many dendrogram slots will be retained.
If run=FALSE
, the function will still calculate reduced
dimensions or filter statistics if not already calculated and saved in the object.
Moreover the results of these calculations will not be save. Therefore, if these
steps are lengthy for large datasets it is
recommended to do them before calling the function.
The given reduceMethod
values must either be all
precalculated filtering/dimensionality reduction stored in the appropriate
location, or must all be character values giving a built-in
filtering/dimensionality reduction methods to be calculated. If some of the
filtering/dimensionality methods are already calculated and stored, but not
all, then they will all be recalculated (and if they are not all
built-in methods, this will give an error). So to save computational time
with pre-calculated dimensionality reduction, the user must make sure they
are all precalculated. Also, user-defined values (i.e. not built-in
functions) cannot be mixed with built-in functions unless they have already
been precalculated (see makeFilterStats
or
makeReducedDims
).
If run=TRUE
will
return a ClusterExperiment
object, where the results are stored as
clusterings with clusterTypes clusterMany
. Depending on
eraseOld
argument above, this will either delete existing such
objects, or change the clusterTypes of existing objects. See argument
eraseOld
above. Arbitrarily the first clustering is set as the
primaryClusteringIndex.
If run=FALSE
a list with elements:
paramMatrix
a matrix giving the parameters of each
clustering, where each column is a possible parameter set by the user and
passed to clusterSingle
and each row of paramMatrix
corresponds to a clustering in clMat
mainClusterArgs
a list of (possibly modified) arguments to mainClusterArgs
seqArgs=seqArgs
a list of (possibly modified) arguments to
seqArgs
subsampleArgs
a list of (possibly modified)
arguments to subsampleArgs
## Not run:
data(simData)
#Example: clustering using pam with different dimensions of pca and different
#k and whether remove negative silhouette values
#check how many and what runs user choices will imply:
checkParams <- clusterMany(simData,reduceMethod="PCA", makeMissingDiss=TRUE,
nReducedDims=c(5,10,50), clusterFunction="pam", isCount=FALSE,
ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE),run=FALSE)
print(head(checkParams$paramMatrix))
#Now actually run it
cl <- clusterMany(simData,reduceMethod="PCA", nReducedDims=c(5,10,50), isCount=FALSE,
clusterFunction="pam",ks=2:4,findBestK=c(TRUE,FALSE),makeMissingDiss=TRUE,
removeSil=c(TRUE,FALSE))
print(cl)
head(colnames(clusterMatrix(cl)))
#make names shorter for plotting
clNames <- clusterLabels(cl)
clNames <- gsub("TRUE", "T", clNames)
clNames <- gsub("FALSE", "F", clNames)
clNames <- gsub("k=NA,", "", clNames)
par(mar=c(2, 10, 1, 1))
plotClusters(cl, axisLine=-2,clusterLabels=clNames)
#following code takes around 1+ minutes to run because of the subsampling
#that is redone each time:
system.time(clusterTrack <- clusterMany(simData, ks=2:15,
alphas=c(0.1,0.2,0.3), findBestK=c(TRUE,FALSE), sequential=c(FALSE),
subsample=c(FALSE), removeSil=c(TRUE), clusterFunction="pam",
makeMissingDiss=TRUE,
mainClusterArgs=list(minSize=5, kRange=2:15), ncores=1, random.seed=48120))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.