seqCluster | R Documentation |
Given a data matrix, this function will call clustering routines, and sequentially remove best clusters, and iterate to find clusters.
seqCluster(
inputMatrix,
inputType,
k0,
subsample = TRUE,
beta,
top.can = 0.01,
remain.n = 30,
k.min = 3,
k.max = k0 + 10,
verbose = TRUE,
subsampleArgs = NULL,
mainClusterArgs = NULL,
warnings = FALSE
)
inputMatrix |
numerical matrix on which to run the clustering or a
|
inputType |
a character vector defining what type of input is given in
the |
k0 |
the value of K at the first iteration of sequential algorithm, see details below or vignette. |
subsample |
logical as to whether to subsample via
|
beta |
value between 0 and 1 to decide how stable clustership membership has to be before 'finding' and removing the cluster. |
top.can |
only the top.can clusters from |
remain.n |
when only this number of samples are left (i.e. not yet clustered) then algorithm will stop. |
k.min |
each iteration of sequential detection of clustering will decrease the beginning K of subsampling, but not lower than k.min. |
k.max |
algorithm will stop if K in iteration is increased beyond this point. |
verbose |
whether the algorithm should print out information as to its progress. |
subsampleArgs |
list of arguments to be passed to
|
mainClusterArgs |
list of arguments to be passed to
|
warnings |
logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters. |
seqCluster
is not meant to be called by the user. It is only
an exported function so as to be able to clearly document the arguments for
seqCluster
which can be passed via the argument seqArgs
in
functions like clusterSingle
and clusterMany
.
This code is adapted from the sequential protion of the code of the tightClust package of Tseng and Wong. At each iteration of the algorithm it finds a set of samples that constitute a homogeneous cluster and remove them, and iterate again to find the next set of samples that form a cluster.
In each iteration, to determine the next set of homogeneous set of
samples, the algorithm will iteratively cluster the current set of samples
for a series of increasing values of the parameter $K$, starting at a value
kinit
and increasing by 1 at each iteration, until a sufficiently
homogeneous set of clusters is found. For the first set of homogeneous
samples, kinit
is set to the argument $k0$, and for iteration,
kinit
is increased internally.
Depending on the value of subsample
how the value of $K$ is
used differs. If subsample=TRUE
, $K$ is the k
sent to the
cluster function clusterFunction
sent to
subsampleClustering
via subsampleArgs
; then
mainClustering
is run on the result of the co-occurance matrix from
subsampleClustering
with the ClusterFunction
object
defined in the argument clusterFunction
set via mainClusterArgs
.
The number of clusters actually resulting from this run of
mainClustering
may not be equal to the $K$ sent to the clustering
done in subsampleClustering
. If subsample=FALSE
,
mainClustering
is called directly on the data to determine the
clusters and $K$ set by seqCluster
for this iteration determines the
parameter of the clustering done by mainClustering
. Specifically,
the argument clusterFunction
defines the clustering of the
mainClustering
step and k
is sent to that
ClusterFunction
object. This means that if subsample=FALSE
,
the clusterFunction
must be of algorithmType
"K".
In either setting of subsample
, the resulting clusters from
mainClustering
for a particular $K$ will be compared to clusters
found in the previous iteration of $K-1$. For computational (and other?)
convenience, only the first top.can
clusters of each iteration will
be compared to the first top.can
clusters of previous iteration for
similarity (where top.can
currently refers to ordering by size, so
first top.can
largest clusters.
If there is no cluster of the first top.can
in the current
iteration $K$ that has overlap similarity > beta
to any in the
previous iteration, then the algorithm will move to the next iteration,
increasing to $K+1$.
If, however, of these clusters there is a cluster in the current
iteration $K$ that has overlap similarity > beta to a cluster in the
previous iteration $K-1$, then the cluster with the largest such similarity
will be identified as a homogenous set of samples and the samples in it
will be removed and designated as such. The algorithm will then start again
to determine the next set of homogenous samples, but without these samples.
Furthermore, in this case (i.e. a cluster was found and removed), the value
of kinit
will be be reset to kinit-1
; i.e. the range of
increasing $K$ that will be iterated over to find a set of homogenous
samples will start off one value less than was the case for the previous
set of homogeneous samples. If kinit-1
<k.min
, then
kinit
will be set to k.min
.
If there are less than remain.n
samples left after finding a
cluster and removing its samples, the algorithm will stop, as subsampling
is deamed to no longer be appropriate. If the K has to be increased to
beyond k.max
without finding any pair of clusters with overlap >
beta, then the algorithm will stop. Any samples not found as part of a
homogenous set of clusters at that point will be classified as unclustered
(given a value of -1)
Certain combinations of inputs to mainClusterArgs
and
subsampleArgs
are not allowed. See clusterSingle
for
these explanations.
A list with values
clustering
a vector of length equal to nrows(x) giving the
integer-valued cluster ids for each sample. The integer values are assigned
in the order that the clusters were found. "-1" indicates the sample was not
clustered.
clusterInfo
if clusters were successfully found, a matrix of
information regarding the algorithm behavior for each cluster (the starting
and stopping K for each cluster, and the number of iterations for each
cluster).
whyStop
a character string explaining what triggered the
algorithm to stop.
Tseng and Wong (2005), "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data", Biometrics, 61:10-16.
tight.clust,
clusterSingle
,mainClustering
,subsampleClustering
## Not run:
data(simData)
set.seed(12908)
clustSeqHier <- seqCluster(simData, inputType="X", k0=5, subsample=TRUE,
beta=0.8, subsampleArgs=list(resamp.n=100,
samp.p=0.7, clusterFunction="kmeans", clusterArgs=list(nstart=10)),
mainClusterArgs=list(minSize=5,clusterFunction="hierarchical01",
clusterArgs=list(alpha=0.1)))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.