#' Perform optimization and penalized K-means clustering
#'
#'
#' This is the central function of the package. As input, only a dataset is
#' required. It starts by performing optimizations and then performs clustering
#' based on the values identified in the optimization step.
#' @importFrom grDevices col2rgb colorRampPalette densCols dev.off palette pdf
#' png
#' @importFrom graphics axis contour hist image legend mtext par plot plot.new
#' text box
#' @importFrom stats median p.adjust predict quantile rnorm sd var wilcox.test
#' runif
#' @importFrom utils write.csv tail
#' @importFrom methods is
#' @importFrom ggplot2 ggplot aes geom_line ggtitle xlab ylab ylim ggsave
#' @importFrom dplyr sample_n
#' @importFrom parallel detectCores makeCluster stopCluster
#' @importFrom doSNOW registerDoSNOW
#' @importFrom foreach foreach %dopar%
#' @param inDataFrame A dataframe or matrix with the data that will be used to
#' create the clustering. Cytometry data should be transformed using
#' biexponential, arcsinh transformation or similar, and day-to-day
#' normalizations should to be performed for all data if not all data has been
#' acquired on the same run. Scaling, etc, is on the other hand performed
#' within the function.
#' @param samplingSubset If the dataset is made up of an unequal number of cells
#' from multiple individuals, it might be wise to pre-define a subset of the
#' rows, which includes equal or near-equal numbers of cells from each
#' individual, to avoid a few outliers to dominate the analysis. This can be
#' done here. Should be a vector of row numbers in the inDataFrame.
#' @param penalties This argument decides whether a single penalty will be used
#' for clustering, or if multiple penalties will be evaluated to identify the
#' optimal one. A single value, a vector of values, or possibly a list of two
#' vectors, if dual clustering is performed can be given here. The suggested
#' default values are empirically defined and might not be optimal for a
#' specific dataset, but the algorithm will warn if the most optimal values are
#' on the borders of the range. Note that when the penalty is 0, there is no
#' penalization, which means that the algorithm runs standard K-means
#' clustering.
#' @param sampleSize This controls what fraction of the dataset that will be
#' used to run the penalty optimization. 'default' results in the full file in
#' files up to 10000 events. In cases where the sampleSize argument is larger
#' than 10000, default leads to the generation of a random subset to the same
#' size also for the selectionSampleSize. A user specified number is also
#' accepted.
#' @param selectionSampleSize The size of the dataset used to find the optimal
#' solution out of the many generated by the penalty optimization at each sample
#' size. 'default' results in the full file in files up to 10000 events. In
#' cases where the sampleSize argument is larger than 10000, default leads to
#' the generation of a random subset to the same size also for the
#' selectionSampleSize. A user specified number is also accepted.
#' @param k Number of initial cluster centers. The higher the number, the
#' greater the precision of the clustering, but the computing time also
#' increases linearly with the number of starting points. Default is 30. If
#' penalties=0, k-means clustering with k clusters will be performed.
#' @param minARIImprovement This is the stop criterion for the penalty
#' optimization algorithm: the more iterations that are run, the smaller will
#' the improvement of the corrected Rand index be, and this sets the threshold
#' when the inner iterations stop. Defaults to 0.01.
#' @param maxIter The maximal number of iterations that are performed in the
#' penalty optimization.
#' @param log2Off If the automatic detection for high kurtosis, and followingly,
#' the log2 transformation, should be turned off.
#' @param optimARI Above this level of ARI, all solutions are considered equally
#' valid, and the median solution is selected among them.
#' @param center If centering should be performed. Alternatives are 'default',
#' 'mean', 'peak', FALSE and a vector of numbers with the same length as the
#' number of columns in the inDataFrame. 'peak' results in centering around the
#' highest peak in the data, which is useful in most cytometry situations.
#' 'mean' results in mean centering. 'default' gives different results
#' depending on the data: datasets with 100+ variables are mean centered,
#' and otherwise, peak centering is used. If a numeric vector is provided, it
#' is used to center the values to the numbers. This is preferable to
#' pre-centering the data and using the FALSE command, as it will lead to better
#' internal visualization procedures, etc. FALSE results in no centering, mainly
#' for testing purposes.
#' @param scale If scaling should be performed. If TRUE, the dataset will be
#' divided by the combined standard deviation of the whole dataset. If a number
#' is provided, the dataset is divided by this number. This scaling
#' procedure makes the default penalties fit most datasets with some precision.
#' @param nCores If multiCore is TRUE, then this sets the number of parallel
#' processes. The default is currently 87.5 percent with a cap on 10 cores, as
#' no speed increase is generally seen above 10 cores for normal computers.
#' @param plotDir Where should the diagnostic plots be printed?
#' @param createOutput For testing purposes. Defaults to TRUE. If FALSE, no
#' plots are generated.
#' @return A nested list:
#' \describe{
#' \item{clusterVector}{A vector with the same length as number of rows in
#' the inDataFrame, where the cluster identity of each observation is
#' noted.}
#' \item{clusterCenters}{A matrix containing information
#' about where the centers are in all the variables that contributed to
#' creating the cluster with the given penalty term. An exact zero here
#' indicates that the variable in question was sparsed out for that cluster.
#' If a variable did not contribute to the separation of any cluster, it will
#' not be present here.}
#' \item{essenceElementList}{A per-cluster list of the items that were
#' used to separate that cluster from the rest, i.e. the items that survived
#' the penalty.}
#' \item{penaltyOptList}{A list of two dataframes:
#' \describe{
#' \item{penaltyOpt.df}{A one row dataframe with the settings for
#' the optimal penalty.}
#' \item{meanOptimDf}{A dataframe with the information about the
#' results with all tested penalty values.}
#' }
#' }
#' \item{logCenterScale}{The values used to center and scale the
#' data and information on if the data was log transformed. This
#' information is used internally in dAllocate.}
#' }
#'
#' @examples
#' # Load some data
#' data(testData)
#'
#' # Here a run with the standard settings
#' \dontrun{
#' testDataDepecheResult <- depeche(testData[, 2:15])
#'
#' # Look at the result
#' str(testDataDepecheResult)
#' }
#'
#' @export depeche
depeche <- function(inDataFrame, samplingSubset = seq_len(nrow(inDataFrame)),
penalties = 2^seq(0, 5, by = 0.5),
sampleSize = "default", selectionSampleSize = "default",
k = 30, minARIImprovement = 0.01, optimARI = 0.95,
maxIter = 100, log2Off = FALSE, center = "default",
scale = TRUE, nCores = "default", plotDir = ".",
createOutput = TRUE) {
if (is.matrix(inDataFrame)) {
inDataFrame <- as.data.frame.matrix(inDataFrame)
}
if (plotDir != ".") {
dir.create(plotDir)
}
logCenterSdResult <- depecheLogCenterSd(inDataFrame, log2Off,
center, scale)
inDataFrameScaled <- logCenterSdResult[[1]]
logCenterSd <- logCenterSdResult[[2]]
# Here, if the dataset is big, a subset of it is used to sample from.
# Here we have also introduced
# the samplingSubset that makes it possible to balance a dataset so that each
# individual gets an equal number of cells as input.
if (length(samplingSubset) > 1e+06) {
warning("This dataset is very large, and it is predicted to run for ",
"a considerable time, without clear benefits in terms of the ",
"result. A preferred option is to generate the model with a ",
"sampling subset and then allocate the remaining data points ",
"using dAllocate.")
}
inDataFrameUsed <- inDataFrameScaled[samplingSubset,]
# Here, the sampleSize is set in cases it
# is 'default'.
if (sampleSize == "default") {
if (nrow(inDataFrameUsed) <= 10000) {
sampleSize <- nrow(inDataFrameUsed)
} else {
sampleSize <- 10000
}
}
if (nCores == "default") {
nCores <- floor(detectCores() * 0.875)
if (nCores > 10) {
nCores <- 10
}
}
penaltyOptResult <- depechePenaltyOpt(inDataFrameUsed,
k = k, maxIter = maxIter,
sampleSize = sampleSize,
penalties = penalties,
createOutput = createOutput,
minARIImprovement = minARIImprovement,
optimARI = optimARI, nCores = nCores,
plotDir = plotDir
)
# Now over to creating the final solution
# Here, the selectionDataSet is created
if (selectionSampleSize == "default") {
selectionSampleSize <- sampleSize
}
if (nrow(inDataFrameUsed) <= selectionSampleSize) {
selectionDataSet <- inDataFrameUsed
} else {
selectionDataSet <-
inDataFrameUsed[sample(seq_len(nrow(inDataFrameUsed)),
selectionSampleSize,
replace = TRUE
), ]
}
# If the dataset is small, a new set of
# seven clusterings are performed (on all
# the data or on a subsample, depending
# on the sample size), and the maximum
# likelihood solution is returned as the
# result
if (nrow(inDataFrameUsed) < 10000) {
penalty <- penaltyOptResult$bestPenalty
depecheAllDataResult <-
depecheAllData(inDataFrameUsed,
penalty = penalty, k = k,
nCores = nCores
)
clusterVector <- depecheAllDataResult[[1]]
reducedClusterCenters <- depecheAllDataResult[[2]]
} else {
# Now, the best run amongst all the runs
# with the largest sample size is
# defined, by identifying the solution
# that gives the highest mean f-measure
# for all the others.
reducedClusterCenters <-
depecheClusterCenterSelection(allSolutions = penaltyOptResult[[3]],
selectionDataSet = selectionDataSet,
k = k, nCores = nCores)
# And here, the optimal solution is created with the full dataset
clusterVector <-
dAllocate(inDataFrameScaled, depModel =
list("clusterCenters" =
reducedClusterCenters,
"logCenterSd" = FALSE))
}
# Now, the clusters are ordered according to their size, and re-numbered
clusterSizeNums <- as.numeric(names(sort(table(clusterVector),
decreasing = TRUE)))
nClust <- length(clusterSizeNums)
newNums <- seq(1, nClust)
clusterVectorNewNums <- clusterVector
sortedClustSizeNums <- sort(clusterSizeNums)
row.names(reducedClusterCenters) <- sortedClustSizeNums
for (i in seq_len(nClust)) {
localNum <- clusterSizeNums[i]
clusterVectorNewNums[clusterVector == localNum] <- newNums[i]
row.names(reducedClusterCenters)[sortedClustSizeNums == localNum] <-
newNums[i]
}
# And here, the rows of the reducedClusterCenters are reordered
reducedClusterCenters <- reducedClusterCenters[
order(as.numeric(row.names(reducedClusterCenters))),]
# Here, the optPenalty information is
# retrieved from the optimal sample size
# run.
optPenalty <- list("bestPenalty" = penaltyOptResult[[1]],
"allPenalties" = penaltyOptResult[[2]])
#Here, we generate a matrix shoing the ground truth of where the cluster
#centres lie in the data, before all the normalisation procedures.
groundClusterCenters <-
depecheGroundCenters(reducedClusterCenters, logCenterSd)
# Here, a list of the essential elements for each cluster is generated.
essenceElementList <-
lapply(seq_len(nrow(reducedClusterCenters)),
function(x) {
return(colnames(reducedClusterCenters)
[which(reducedClusterCenters[x, ] != 0)])
})
#Now, in applicable cases, we produce a heatmap of the cluster centres
depecheHeatMap(inDataFrameUsed, reducedClusterCenters,
plotDir, logCenterSd, createOutput)
# And now, the result that should be
# returned is compiled
depecheResult <- list(
"clusterVector" = clusterVectorNewNums,
"clusterCenters" = groundClusterCenters,
"essenceElementList" = essenceElementList,
"penaltyOptList" = optPenalty,
"logCenterSd" = logCenterSd
)
if (logCenterSd[[1]]) {
names(depecheResult)[2] <- "log2ClusterCenters"
}
depecheResult
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.