BiocStyle::markdown() knitr::opts_chunk$set(tidy = FALSE, warning = FALSE, message = FALSE)
library(DOSE) library(GO.db) library(org.Hs.eg.db) library(topGO) library(GSEABase) library(clusterProfiler)
r Biocpkg("clusterProfiler")
implements methods to analyze and visualize functional profiles of genomic coordinates (supported by r Biocpkg("ChIPseeker")
), gene and gene clusters.
If you use r Biocpkg("clusterProfiler")
in published research, please cite:
G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287. doi:10.1089/omi.2011.0118
In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular molecules at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities[@yu2010].
To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes.
After clustering analysis, researchers not only want to determine
whether there is a common theme of a particular gene cluster, but also
to compare the biological themes among gene clusters. The manual step
to choose interesting clusters followed by enrichment analysis on each
selected cluster is slow and tedious. To bridge this gap, we designed
r Biocpkg("clusterProfiler")
[@yu2012], for comparing and visualizing functional
profiles among gene clusters.
bitr
: Biological Id TranslatoRr Biocpkg("clusterProfiler")
provides bitr
and bitr_kegg
for converting ID types. Both bitr
and bitr_kegg
support many species including model and many non-model organisms.
x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2", "ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1", "NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1", "COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC", "ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT", "MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK", "STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B") eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db") head(eg)
User should provides an annotation package, both fromType and toType can accept any types that supported.
User can use keytypes to list all supporting types.
library(org.Hs.eg.db) keytypes(org.Hs.eg.db)
We can translate from one type to other types.
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db") head(ids)
For GO analysis, user don't need to convert ID, all ID type provided by OrgDb
can be used in groupGO
, enrichGO
and gseGO
by specifying keytype
parameter.
bitr_kegg
: converting biological IDs using KEGG APIdata(gcSample) hg <- gcSample[[1]] head(hg) eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa') head(eg2np)
The ID type (both fromType
& toType
) should be one of 'kegg', 'ncbi-geneid', 'ncbi-proteinid' or 'uniprot'. The 'kegg' is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the 'kegg' ID is entrezgene
ID for eukaryote species and Locus
ID for prokaryotes.
Many prokaryote species don't have entrezgene ID available. For example we can check the gene information of ece:Z5100
in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID
and UnitProt
links in the Other DBs
Entry, but not NCBI-GeneID
.
If we try to convert Z5100
to ncbi-geneid
, bitr_kegg
will throw error of ncbi-geneid is not supported
.
bitr_kegg("Z5100", fromType="kegg", toType='ncbi-geneid', organism='ece')
## Error in KEGG_convert(fromType, toType, organism) : ## ncbi-geneid is not supported for ece ...
We can of course convert it to ncbi-proteinid
and uniprot
:
bitr_kegg("Z5100", fromType="kegg", toType='ncbi-proteinid', organism='ece') bitr_kegg("Z5100", fromType="kegg", toType='uniprot', organism='ece')
GO analyses (groupGO()
, enrichGO()
and gseGO()
) support organisms that have an OrgDb
object available.
Bioconductor have already provide OrgDb
for about 20 species. User can query OrgDb
online by r Biocpkg("AnnotationHub")
or build their own by r Biocpkg("AnnotationForge")
. An example can be found in the vignette of r Biocpkg("GOSemSim")
.
If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher()
and gseGO()
functions to perform over-representation test and gene set enrichment analysis.
If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap
function, which will infer indirection annotation and generate a data.frame
that suitable for both enricher()
and gseGO()
.
In r Biocpkg("clusterProfiler")
, groupGO
is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList
provided by r Biocpkg("DOSE")
. Please refer to vignette of r Biocpkg("DOSE")
for more details.
data(geneList, package="DOSE") gene <- names(geneList)[abs(geneList) > 2] gene.df <- bitr(gene, fromType = "ENTREZID", toType = c("ENSEMBL", "SYMBOL"), OrgDb = org.Hs.eg.db) head(gene.df) ggo <- groupGO(gene = gene, OrgDb = org.Hs.eg.db, ont = "CC", level = 3, readable = TRUE) head(ggo)
The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb
).
If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.
Over-representation test[@boyle2004] were implemented in r Biocpkg("clusterProfiler")
. For calculation details and explanation of paramters, please refer to the vignette of r Biocpkg("DOSE")
.
ego <- enrichGO(gene = gene, universe = names(geneList), OrgDb = org.Hs.eg.db, ont = "CC", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05, readable = TRUE) head(ego)
As I mentioned before, any gene ID type that supported in OrgDb
can be directly used in GO analyses. User need to specify the keytype
parameter to specify the input gene ID type.
ego2 <- enrichGO(gene = gene.df$ENSEMBL, OrgDb = org.Hs.eg.db, keytype = 'ENSEMBL', ont = "CC", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05)
Gene ID can be mapped to gene Symbol by using paramter readable=TRUE
or setReadable
function.
ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)
enrichGO
test the whole GO corpus and enriched result may contains very general terms. With dropGO
function, user can remove specific GO terms or GO level from results obtained from both enrichGO
and compareCluster
.
enrichGO
doesn't contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter
to restrict the result at specific GO level. It works with results obtained from both enrichGO
and compareCluster
.
According to issue #28, I implement a simplify
method to remove redundant GO terms obtained from enrichGO
. An example can be found in the blog post. It internally call r Biocpkg("GOSemSim")
to calculate similarities among GO terms and remove those highly similar terms by keeping one representative term. The simplify
method also works with both outputs from enrichGO
and compareCluster
.
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)[@subramanian_gene_2005] directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
For algorithm details, please refer to the vignette of r Biocpkg("DOSE")
.
ego3 <- gseGO(geneList = geneList, OrgDb = org.Hs.eg.db, ont = "CC", nPerm = 1000, minGSSize = 100, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE)
GSEA use permutation test, user can set nPerm for number of permutations. Only gene Set size in [minGSSize, maxGSSize]
will be tested.
GO semantic similarity can be calculated by r Biocpkg("GOSemSim")
[@yu2010]. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.
The annotation package, r Biocannopkg("KEGG.db")
, is not updated since 2012. It's now pretty old and in r Biocpkg("clusterProfiler")
, enrichKEGG
(for KEGG pathway) and enrichMKEGG
(for KEGG module) supports downloading latest online version of KEGG data for enrichment analysis. Using r Biocannopkg("KEGG.db")
is also supported by explicitly setting use_internal_data parameter to TRUE, but it's not recommended.
With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.
r Biocpkg("clusterProfiler")
provides search_kegg_organism()
function to help searching supported organisms.
search_kegg_organism('ece', by='kegg_code') ecoli <- search_kegg_organism('Escherichia coli', by='scientific_name') dim(ecoli) head(ecoli)
kk <- enrichKEGG(gene = gene, organism = 'hsa', pvalueCutoff = 0.05) head(kk)
Input ID type can be kegg
, ncbi-geneid
, ncbi-proteinid
or uniprot
, an example can be found in the post.
kk2 <- gseKEGG(geneList = geneList, organism = 'hsa', nPerm = 1000, minGSSize = 120, pvalueCutoff = 0.05, verbose = FALSE) head(kk2)
KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.
mkk <- enrichMKEGG(gene = gene, organism = 'hsa')
mkk2 <- gseMKEGG(geneList = geneList, species = 'hsa')
r Biocpkg("DOSE")
[@yu_dose_2015] supports Disease Ontology (DO) Semantic and Enrichment analysis. The enrichDO
function is very useful for identifying disease association of interesting genes, and function gseDO
function is designed for gene set enrichment analysis of DO.
In addition, r Biocpkg("DOSE")
also supports enrichment analysis of Network of Cancer Gene (NCG)[@omer_ncg] and Disease Gene Network[@janet_disgenet], please refer to the r Biocpkg("DOSE")
vignettes.
r Biocpkg("ReactomePA")
[@yu_reactomepa_2016] uses Reactome as a source of pathway data. The function call of enrichPathway
and gsePathway
in r Biocpkg("ReactomePA")
is consistent with enrichKEGG
and gseKEGG
.
r Biocpkg("clusterProfiler")
provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID[@huang_david_2007] and still attracted by the visualization methods provided by r Biocpkg("clusterProfiler")
[@paranjpe_genome_wid_2013]. To bridge the gap between DAVID and clusterProfiler, we implemented enrichDAVID
. This function query enrichment analysis result from DAVID webserver via r Biocpkg("RDAVIDWebService")
[@fresno_rdavidwebservice_2013] and stored the result as an enrichResult
instance, so that we can use all the visualization functions in r Biocpkg("clusterProfiler")
to visualize DAVID results. enrichDAVID
is fully compatible with compareCluster
function and comparing enrichment results from different gene clusters is now available with DAVID.
david <- enrichDAVID(gene = gene, idType = "ENTREZ_GENE_ID", listType = "Gene", annotation = "KEGG_PATHWAY", david.user = "clusterProfiler@hku.hk")
DAVID Web Service has the following limitations:
For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.
As user has limited usage, please register and use your own user account to run enrichDAVID
.
r Biocpkg("clusterProfiler")
supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it's still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.
r Biocpkg("clusterProfiler")
provides enricher
function for hypergeometric test and GSEA
function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
An example of using enricher
and GSEA
to analyze DisGeNet annotation is presented in the post, use clusterProfiler as an universal enrichment analysis tool.
The MSigDB is a collection of annotated gene sets, it include 8 major collections:
Users can use enricher
and GSEA
function to analyze gene set collections downloaded from Molecular Signatures Database (MSigDb). r Biocpkg("clusterProfiler")
provides a function, read.gmt
, to parse the gmt file into a TERM2GENE data.frame
that is ready for both enricher
and GSEA
functions.
gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler") c5 <- read.gmt(gmtfile) egmt <- enricher(gene, TERM2GENE=c5) head(egmt) egmt2 <- GSEA(geneList, TERM2GENE=c5, verbose=FALSE) head(egmt2)
Functional analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via r Biocpkg("ChIPseeker")
[@yu_chipseeker_2015] package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene
, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to r Biocpkg("ChIPseeker")
.
The function calls of groupGO
, enrichGO
, enrichKEGG
, enrichDO
, enrichPathway
and enricher
are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.
barplot(ggo, drop=TRUE, showCategory=12)
barplot(ego, showCategory=8)
dotplot is a good alternative to barplot
.
dotplot(ego)
Enrichment map can be viusalized by enrichMap
, which also support results obtained from hypergeometric test and gene set enrichment analysis.
enrichMap(ego)
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed cnetplot
function to extract the complex association.
## categorySize can be scaled by 'pvalue' or 'geneNum' cnetplot(ego, categorySize="pvalue", foldChange=geneList)
plotGOgraph
, which is based on r Biocpkg("topGO")
, can accept output of enrichGO
and visualized the enriched GO induced graph.
plotGOgraph(ego)
Running score of gene set enrichment analysis and its association of phenotype can be visualized by gseaplot
.
gseaplot(kk2, geneSetID = "hsa04145")
To view the KEGG pathway, user can use browseKEGG
function, which will open web browser and highlight enriched genes.
browseKEGG(kk, 'hsa04110')
r Biocpkg("clusterProfiler")
users can also use pathview
from the r Biocpkg("pathview")
[@luo_pathview] to visualize KEGG pathway.
The following example illustrate how to visualize "hsa04110" pathway, which was enriched in our previous analysis.
library("pathview") hsa04110 <- pathview(gene.data = geneList, pathway.id = "hsa04110", species = "hsa", limit = list(gene=max(abs(geneList)), cpd=1))
For further information, please refer to the vignette of r Biocpkg("pathview")
[@luo_pathview].
r Biocpkg("clusterProfiler")
was developed for biological theme comparison[@yu2012], and it provides a function, compareCluster
, to automatically calculate enriched functional categories of each gene clusters.
data(gcSample) lapply(gcSample, head)
The input for geneCluster parameter should be a named list of gene IDs. To speed up the compilation of this document, we set use_internal_data = TRUE
.
ck <- compareCluster(geneCluster = gcSample, fun = "enrichKEGG") head(as.data.frame(ck))
compareCluster
also supports passing a formula (the code to support formula has been contributed by Giovanni Dall'Olio) of type $Entrez \sim group$ or $Entrez \sim group + othergroup$.
mydf <- data.frame(Entrez=names(geneList), FC=geneList) mydf <- mydf[abs(mydf$FC) > 1,] mydf$group <- "upregulated" mydf$group[mydf$FC < 0] <- "downregulated" mydf$othergroup <- "A" mydf$othergroup[abs(mydf$FC) > 2] <- "B" formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichKEGG") head(as.data.frame(formula_res))
We can visualize the result using dotplot
method.
dotplot(ck)
dotplot(formula_res) dotplot(formula_res, x=~group) + ggplot2::facet_grid(~othergroup)
By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter showCategory to specify how many categories of each cluster to be plotted, and if showCategory was set to NULL, the whole result will be plotted.
The plot function accepts a parameter by for setting the scale of dot sizes. The default parameter by is setting to "geneRatio", which corresponding to the "GeneRatio" column of the output. If it was setting to count, the comparison will be based on gene counts, while if setting to rowPercentage, the dot sizes will be normalized by count/(sum of each row)
To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is setting to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is setting to geneRatio, as shown in Figure 3. If the dot sizes were based on count, the row numbers will not shown.
The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter pvalueCutoff, and FDR can be estimated by qvalue.
User can refer to the example in Yu (2012)[@yu2012]; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus)[@schmidt2008]. We identified 8 gene clusters from differentially expressed genes, and using compareCluster
to compare these gene clusters by their enriched biological process.
The comparison function was designed as a framework for comparing gene
clusters of any kind of ontology associations, not only groupGO
,
enrichGO
, enrichKEGG
and enricher
provided in this package, but
also other biological and biomedical ontologies, for instance,
enrichDO
from r Biocpkg("DOSE")
[@yu_dose_2015], enrichMeSH
from
r Biocpkg("meshes")
and enrichPathway
from r Biocpkg("ReactomePA")
work fine with compareCluster
for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of r Biocpkg("DOSE")
[@yu_dose_2015] and r Biocpkg("ReactomePA")
.
Please visit clusterProfiler homepage for more information.
Here is the output of sessionInfo()
on the system on which this document was compiled:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.