library(markdown) options(markdown.HTML.options = c(options('markdown.HTML.options')[[1]], "toc")) library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, fig.align = "center") options(markdown.HTML.stylesheet = "custom.css") options(rmarkdown.html_vignette.check_title = FALSE) options(width = 100) library(ComplexHeatmap) library(circlize) library(cola)
Choosing the best k (number of partitions) is not an easy problem for consensus partitioning. In consensus partitioning, various metrics based on the consensus matrix are normally calculated, e.g. PAC scores (or 1-PAC) or mean silhouette scores, and the best k is normally selected based on the "extremal method", i.e. to select the k that corresponds to the highest or lowest values of the metrics. When the number of partitions is small, it is relatively easy to determine the best k with high confidence, while when the real number of clusters gets very large, it is difficult to identify the correct or approximate k for several reasons, some of which we list in the following: 1) Variation in the "big clusters" affect the eCDF of the consensus matrix stronger than variation in the "small clusters", this can strongly affect PAC scores. 2) Groups showing weaker differences (we can call them secondary groups) are more difficult to separate especially when there are already other groups showing distinct differences (we can call them major groups). 3) The curve of various metrics against k gets flattened for large k and the value of k with the extremal values will be less distinct.
The following four figures illustrate the eCDF curves of a consensus matrix, 1-PAC, mean silhouette and concordance scores for different k where k ranges from 2 to 10 (from the analysis here). Basically, when k >= 5, the eCDF curves have a long plateau with less and less curvature, i.e. they lose their step-like shape, which results in 1-PAC getting almost stable for k >= 5. Also for the curves of mean silhouette and concordance scores against k, they are almost flattened for k >= 3. If using the "extremal method", k = 6 is taken as the best k because 1-PAC selects k = 9 while mean silhouette and concordance select k = 6.
When inspecting the consensus heatmaps for different k (cf. heatmaps below), actually it is difficult to assess whether the partitioning for k = 6 is better than any other k in [5, 10].
The problems of the "big clusters / small clusters" or "major clusters / secondary clusters" in selecting the best k are mainly due to the consensus partitioning procedure that all samples are taken into account equally. From version 2.0.0, we proposed a new method which tries to solve this issue by applying consensus partitioning in a hierarchical way. Simply speaking, one could first classify the samples into nmajor groups (nmajor is a small number, major clusters), then for each subgroup of samples, one could repeatedly apply consensus clustering. By this means, theoretically, small clusters or secondary clusters could be detected in later steps of the hierarchical procedure.
The figure below illustrates the workflow of the hierarchical consensus partitioning.
The steps are:
The process of the hierarchical consensus partitioning is saved as a dendrogram internally.
In this section, we demonstrate the functionalities of hierarchical consensus
partitioning. The design of these new functionalities tries to be as
consistent as the functions for normal consensus partitioning in cola,
i.e., the function consensus_partition()
or
run_all_consensus_partition_methods()
. Thus, you may find many functions
having the same names as the functions for normal consensus partitioning.
The following code applies hierarchical consensus partitioning on the Golub
dataset. Function hierarchical_partition()
applies the analysis where the
main arguments are very similar as in consensus_partition()
or
run_all_consensus_partition_methods()
, which are the input matrix, the
sample annotations and the number of cores. The function returns a
HierarchicalPartition
object.
library(golubEsets) # from Bioconductor data(Golub_Merge) m = exprs(Golub_Merge) colnames(m) = paste0("sample_", colnames(m)) anno = pData(Golub_Merge) m[m <= 1] = NA m = log10(m) m = adjust_matrix(m) # apply quantile normalization library(preprocessCore) # from Bioconductor cn = colnames(m) rn = rownames(m) m = normalize.quantiles(m) colnames(m) = cn rownames(m) = rn set.seed(123) golub_cola_rh = hierarchical_partition( m, anno = anno[, c("ALL.AML"), drop = FALSE] )
Some important arguments in hierarchical_partition()
are listed as follows:
top_n
: Number of rows with top values. The value can be a vector of
integers. On each node, there is an additional filtering on rows of the
submatrices to remove those rows with very small variances, which results in
the reducing number of rows in the submatrices. Here top_n
can be set as a
vector of values less than 1 which are treated as the fraction of rows of
every submatrix.top_value_method
: A single or a vector of top-value methods.partition_method
: A single or a vector of partitioning methods. All
combinations of top_value_method
and partition_method
are tried.combination_method
: Instead of specifying top_value_method
and
partition_method
, all methods that are going to be tried can also be
specified with combination_method
. The value can be a vector in a form of
c("SD:hclust", "ATC:skmeans", ...)
or a data frame with two columns
data.frame(c("SD", "ATC"), c("hclust", "skmeans"))
.anno
: Annotation for the columns. The value can be a vector or a data
frame. Note the rows in anno
should be corresponded to the matrix columns.anno_col
: Colors for annotations. The value should be a list of colors
where a named vector for discrete color mapping and a color mapping function
generated by circlize::colorRamp2()
for continuous color mapping.mean_silhouette_cutoff
: Cut off for mean silhouette_cutoff score to decide whether a partition
is stable and to be split further more.min_samples
: Miminal number of samples to perform partitioning.subset
: Number of subset columns to perform partitioning. If the current
number of columns in the submatrix is higher than subset
,
consensus_partition_by_down_sampling()
instead of consensus_partition()
will be applied. It will be discussed in more details in the Section Work
with huge datasets.min_n_signatures
: On each node that has a partition, get_signatures()
is
applied to find the number of signatures under the best k. If the number of
signatures is less than min_n_signatures
, it means the partitioning might
not be significantly different and the hierarchical consensus partitioning
stops.min_p_signatures
: This is the fraction of signatures to the total number
of rows of the original matrix. This filtering is a companion of
min_n_signatures
.max_k
: Maximal number of groups to try for consensus partitioning.
Normally this value should be set to a small value, because more subgroups
will be found during the hierarchical consensus partitioning process.cores
: hierarchical_partition()
supports parallel computing. This is
the number of cores to use.The object golub_cola_rh
is already generated and shipped in cola package, so we directly load it.
data(golub_cola_rh)
golub_cola_rh
Directly entering golub_cola_rh
prints the hierarchy. As you already see in
the previous output, the node in the hierarchy is encoded in a special way. As
explained in previous text, on each node, the columns are split into several
groups and a digit (the subgroup index in current partitioning) is appended with the current node label to the
children node labels. Thus, the length (or nchar
) of the label represents
the depth of that node in the hierarchy and from the node label, it is also
straightforward to infer its parent node. E.g., a node with label 0123
has its
parent node 012
.
Also you can find the functions that can be applied to the HierarchicalPartition
object
in previous output.
The first function you may try is to see how the columns are separated and the hierarchy
of the subgroups. This can be done by collect_classes()
function:
collect_classes(golub_cola_rh)
There are several metrics saved for each node which can be retrieved by node_info()
.
node_info(golub_cola_rh)
There are following columns from node_info()
:
id
: The node id or label.best_method
: Because on each node, multiple methods set in combination_method
are tried
and the method that gives the highest 1-PAC under its best k is finally selected. The best
method is saved here.depth
: The depth of the node in the hierarchy.best_k
: The best k used.n_columns
: Number of columns of the submatrix.n_signatures
: Number of signatures.p_signatures
: Fraction of signatures to the total number of rows of the matrix.is_leaf
: Whether the node is a leaf node.These values are useful to merge the children nodes.
Most functions for dealing with the HierarchicalPartition
object accept a merge_node
argument,
where you can set different paremeters to select the children node to merge. These parameters
should be set by the function merge_node_param()
function. And there are the four parameters
can be adjusted:
depth
min_n_signatures
min_p_signatures
min_n_signatures
measures how much you trust the classification on the bioloigcal point of view.
In the following, we demonstrate to manuplate the dendrogram by
setting different min_n_signatures
values.
collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 174)) collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263)) collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 2041))
p1 = grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 174), row_title = "min_n_signatures >= 174")) p2 = grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263), row_title = "min_n_signatures >= 263")) p3= grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 2041), row_title = "min_n_signatures >= 2041")) grid.newpage() pushViewport(viewport(x = 0, width = 1/3, just = "left")) grid.draw(p1) popViewport() pushViewport(viewport(x = 1/3, width = 1/3, just = "left")) grid.draw(p2) popViewport() pushViewport(viewport(x = 2/3, width = 1/3, just = "left")) grid.draw(p3) popViewport()
We can also compare to the normal consensus partitioning classification:
data(golub_cola) golub_cola_cp = golub_cola["ATC:skmeans"] collect_classes(golub_cola_rh, anno = cbind(get_anno(golub_cola_rh), cola_cp = factor(get_classes(golub_cola_cp, k = suggest_best_k(golub_cola_cp))[, "class"])), anno_col = c(get_anno_col(golub_cola_rh)) )
get_classes()
returns the subgroups of columns:
get_classes(golub_cola_rh)
Also you can control merge_node
argument to decide on which level of hierarchy you want.
get_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263))
suggest_best_k()
extracts the the best k as well as related metrics for the
best partitions on each node.
suggest_best_k(golub_cola_rh)
Another important function which gives a direct feeling of how the subgrouping look like
is to check the signatures that are significantly different between subgroups.
Similarly as normal consensus partitioning, we can use get_signatures()
here.
The function basically applies get_signatures,ConsesusPartition-method()
on the partition
on every node and collect all the signatures as the signatures of the hierarchical
consensus partitioning.
get_signatures(golub_cola_rh, verbose = FALSE)
Other useful functions are dimension_reduction()
, compare_signatures()
and
test_to_known_factors()
. The usages are as follows:
dimension_reduction(golub_cola_rh)
compare_signatures(golub_cola_rh)
test_to_known_factors(golub_cola_rh)
Note all these functions mentioned above allow the merge_node
argument to adjust
the hierarchy.
One of the key advantage of cola package is it automates the complete analysis.
There is also a cola_report()
function for HierarchicalPartition
class and it
automates the complete analysis as well. Simply run:
rh = hierarchical_partition(...) cola_report(rh, output_dir = ...)
In the vignette "Work with Big Datasets", we introduced
a DownSamplingConsensusPartition
class and its corresponding method consensus_partition_by_down_sampling()
which performs consensus partitioning on a randomly sampled subset of columns and predict the subgroup
labels for the remaining columns from the ... Here hierarchical_partition()
also supports down sampling
which makes it possible to work on extremly large datasets.
The only thing for dealing with huge datasets is to set the subset
argument.
hierarchical_partition(..., subset = ...)
On each node, to consider the euqal sizes of groups, we first perform a fast k-means and random sample columns with different weight according to the size the groups.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.