library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, warning = FALSE, fig.align = "center", dev = "jpeg" ) options(width = 100)
The simplifyEnrichment package clusters functional terms into groups by clustering the similarity matrix of the terms with a new proposed method "binary cut" which recursively applies partition around medoids (PAM) with two groups on the similarity matrix and in each iteration step, a score is assigned to decide whether the group of gene sets that corresponds to the current sub-matrix should be split or not. For more details of the method, please refer to the simplifyEnrichment paper.
library(simplifyEnrichment) mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment")) go_id = rownames(mat)
The major use case for simplifyEnrichment is for simplying the GO enrichment results by clustering the corresponding semantic similarity matrix of the significant GO terms. To demonstrate the usage, we first generate a list of random GO IDs from the Biological Process (BP) ontology category:
library(simplifyEnrichment) set.seed(888) go_id = random_GO(500)
simplifyEnrichment starts with the GO similarity matrix. Users can use
their own similarity matrices or use the GO_similarity()
function to
calculate the semantic similarity matrix. The GO_similarity()
function is
simply a wrapper on GOSemSim::termSim()
. The function accepts a vector of GO
IDs. Note the GO terms should only belong to one same ontology (i.e., BP
,
CC
or MF
).
mat = GO_similarity(go_id)
By default, GO_similarity()
uses Rel
method in GOSemSim::termSim()
. Other
methods to calculate GO similarities can be set by measure
argument, e.g.:
GO_similarity(go_id, measure = "Wang")
With the similarity matrix mat
, users can directly apply simplifyGO()
function to perform the clustering as well as visualizing the results.
df = simplifyGO(mat)
On the right side of the heatmap there are the word cloud annotations which summarize the functions with keywords in every GO cluster. Note there is no word cloud for the cluster that is merged from small clusters (size < 5).
The returned variable df
is a data frame with GO IDs, GO terms and the
cluster labels:
head(df)
The size of GO clusters can be retrieved by:
sort(table(df$cluster))
Or split the data frame by the cluster labels:
split(df, df$cluster)
plot
argument can be set to FALSE
in simplifyGO()
, so that no plot is
generated and only the data frame is returned.
If the aim is only to cluster GO terms, binary_cut()
or cluster_terms()
functions can be
directly applied:
binary_cut(mat)
or
cluster_terms(mat, method = "binary_cut")
binary_cut()
and cluster_terms()
basically generate the same clusterings, but the labels of clusters might differ.
Semantic measurements can be used for the similarity of GO terms. However,
there are still a lot of ontologies (e.g. MsigDB gene sets) that are only
represented as a list of genes where the similarity between gene sets are
mainly measured by gene overlap. simplifyEnrichment provides the
term_similarity()
and other related functions
(term_similarity_from_enrichResult()
, term_similarity_from_KEGG()
,
term_similarity_from_Reactome()
, term_similarity_from_MSigDB()
and
term_similarity_from_gmt()
) which calculate the similarity of terms by the
gene overlapping, with methods of Jaccard
coefficient, Dice
coefficient,
overlap coefficient and
kappa coefficient.
The similarity can be calculated by providing:
enrichResult
object which is normally from the 'clusterProfiler', 'DOSE', 'meshes' or 'ReactomePA' package.Once you have the similarity matrix, you can send it to simplifyEnrichment()
function.
But note, as we benchmarked in the manuscript, the clustering on the gene
overlap similarity performs much worse than on the semantic similarity.
In the simplifyEnrichment package, there are also functions that compare
clustering results from different methods. Here we still use previously
generated variable mat
which is the similarity matrix from the 500 random GO
terms. Simply running compare_clustering_methods()
function performs all supported
methods (in all_clustering_methods()
) excluding mclust
, because
mclust
usually takes very long time to run. The function generates a figure
with three panels:
In the barplots, the three metrics are defined as follows:
compare_clustering_methods(mat)
If plot_type
argument is set to heatmap
. There are heatmaps for the
similarity matrix under different clusterings methods. The last panel is a
table with the number of clusters.
compare_clustering_methods(mat, plot_type = "heatmap")
Please note, the clustering methods might have randomness, which means,
different runs of compare_clustering_methods()
may generate different clusterings
(slightly different). Thus, if users want to compare the plots between
compare_clustering_methods(mat)
and compare_clustering_methods(mat, plot_type = "heatmap")
, they
should set the same random seed before executing the function.
set.seed(123) compare_clustering_methods(mat) set.seed(123) compare_clustering_methods(mat, plot_type = "heatmap")
compare_clustering_methods()
is simply a wrapper on cmp_make_clusters()
and cmp_make_plot()
functions where the former function performs
clustering with different methods and the latter visualizes the results. To
compare different plots, users can also use the following code without
specifying the random seed.
clt = cmp_make_clusters(mat) # just a list of cluster labels cmp_make_plot(mat, clt) cmp_make_plot(mat, clt, plot_type = "heatmap")
New clustering methods can be added by register_clustering_methods()
,
removed by remove_clustering_methods()
and reset to the default methods by
reset_clustering_methods()
. All the supported methods can be retrieved by
all_clustering_methods()
. compare_clustering_methods()
runs all the clustering methods
in all_clustering_methods()
.
The new clustering methods should be as user-defined functions and sent to
register_clustering_methods()
as named arguments, e.g.:
register_clustering_methods( method1 = function(mat, ...) ..., method2 = function(mat, ...) ..., ... )
The functions should accept at least one argument which is the input matrix
(mat
in above example). The second optional argument should always be ...
so that parameters for the clustering function can be passed by control
argument from cluster_terms()
or simplifyGO()
. If users forget to add
...
, it is added internally.
Please note, the user-defined function should automatically identify the optimized number of clusters. The function should return a vector of cluster labels. Internally it is converted to numeric labels.
There are following examples which we did for the benchmarking in the manuscript:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.