This package aims to visualize the word and text information contained in the gene or the other omics identifiers such as microbiome, and identify important words among the clusters, integrate, and compare the clusters based on those information. It contributes to understanding the functional implications of omics identifier lists and aid in interpretation and visualization. In this vignette, the basic usage for generating the network and combining them are introduced, as well as customized usage. The detailed options and usage like the integration with the other packages are available in the package's bookdown. The web server is available for the convenient querying here.
Install and load the package and the database for converting identifiers. In this example, we use mostly human-derived data, and use org.Hs.eg.db
.
# devtools::install_github("noriakis/biotextgraph") library(biotextgraph) library(org.Hs.eg.db) library(ggplot2) library(ggraph)
The main function accepts some omics identifiers and generates a network. Various functions are available for this purpose, and to create a network from a gene list, refseq is used. By specifying plotType as network, a network is generated. This generates a biotext
class object with slots containing various information, which is then returned. The default ID type is SYMBOL, but it can be specified arbitrarily in keyType
. As many of the words are commonly observed, you should limit word frequency by excludeFreq
, which is default to 2000. TF-IDF on the all the summary is precomputed, and exclude="tfidf"
can be specified too. The net
slot stores the visualization result generated by ggraph
.
## Configure input genes (ERCC genes) inpSymbol <- c("ERCC1","ERCC2","ERCC3","ERCC4","ERCC5","ERCC6","ERCC8") net <- refseq(inpSymbol, plotType="network") net plotNet(net, asis=TRUE)
The network visualization can be customized using the options like colorText
, which colorizes the text based on the node colors, or edgeLink
to customize the geom to represent edges.
cxcls <- NULL for (i in c(1,2,3,5,6,8,9,10,11,12,13,14,16)){ cxcls <- c(cxcls, paste0("CXCL",i)) } cxclNet <- refseq(cxcls, plotType="network", colorText=TRUE, edgeLink=FALSE, autoThresh=FALSE) plotNet(cxclNet, asis=TRUE)
It is possible to draw the IDs related to important words in the text in the network within the ID list. This makes it possible to extract important IDs within the list based on word frequency. For The genes, those associated with the frequently occurred words within the cluster can be shown by genePlot=TRUE
, and the number can be controlled by genePlotNum
.
net <- refseq(cxcls, plotType="network", autoThresh=FALSE, colorText=TRUE, edgeLink=FALSE, genePlot=TRUE, genePlotNum=5) net plotNet(net, asis=TRUE)
Tagging the set of words is possible by enabling the option tag="cor"
(based on the adjacency matrix of inferred network) or tag="tdm"
(term-document matrix). This allows us to see what word sets appear significantly and to reflect this information in the plots.
tag <- refseq(c(inpSymbol, cxcls), plotType="network", tag="cor", colorText=TRUE, edgeLink=FALSE, genePlot=TRUE, genePlotNum=5) getSlot(tag, "pvpick")
It is possible to perform searches against databases such as PubMed using the obtained important genes as queries, and visualize the results. It is recommended to use a PubMed API key for this purpose. Specify one in apiKey
.
getSlot(net, "geneCount") |> head() pmquery <- getSlot(net, "geneCount") |> head() |> names()
## Not run in vignette # pubmed(pmquery, plotType="network")
Each process can be break down to piping operation or storing the results for the later analysis.
btg <- obtain_refseq(inpSymbol) |> ## obtain RefSeq description set_filter_words() |> ## Set filtering words make_corpus() |> ## Make corpus make_TDM() |> ## Make term-document matrix make_graph() |> ## Make graph process_network_gene(gene_plot=TRUE) |> ## Process graph for showing associated genes to words plot_biotextgraph(edge_link=FALSE) |> ## Make plot for the network (stored in `net` slot) plot_wordcloud() ## Make wordcloud plot (stored in `wc` slot)
As an example of comparing text information on a network, this section demonstrates the comparison of gene lists within KEGG pathways. For applications using actual public data, please refer to the documentation.
keggPathways <- org.Hs.egPATH2EG mappedKeys <- mappedkeys(keggPathways) keggList <- as.list(keggPathways[mappedKeys]) ## Hepatitis C hCNet <- refseq(keggList$`05160`, plotType="network", layout="nicely", keyType="ENTREZID", autoThresh=FALSE, excludeFreq = 5000, colorText=TRUE, edgeLink=FALSE, showLegend=FALSE) plotNet(hCNet, asis=TRUE)
We create another biotext
object to compare with.
ecoli <- refseq(keggList$`05130`, keyType="ENTREZID", autoThresh=FALSE)
Comparison of networks can be performed by compareWordNet
. By providing multiple biotext class objects, it is possible to create a new network by integrating the networks and tag information contained in each object. This makes it possible to compare multiple different IDs.
compareWordNet(list(hCNet, ecoli), titles=c("RefSeq_05160","RefSeq_05130"), colPal = "Dark2") |> plotNet()
The summarization of text in enrichment analysis results can be optionally performed by enrich
option. The below example shows enrichment analysis of KEGG database.
if (requireNamespace("clusterProfiler")) { hCNetK <- refseq(keggList$`05160`, enrich="kegg", keyType="ENTREZID",cooccurrence = TRUE, topPath=50, numWords=50, autoThresh=FALSE, plotType="network", corThresh=0.1) plotNet(hCNetK, asis=TRUE) }
Other than genes, microbial information can also be summarized in the similar manner. For obtaining and summarizing information on disease relationship, enzymes, metabolites, and biological pathways, please refer to the documentation. Furthermore, a manual function (manual
) is available that performs similar operations based on customized user input.
The package provides the other visualization options such as producing a wordcloud of biomedical textual information by querying gene IDs or other identifiers.
gwc <- refseq(inpSymbol, plotType="wc") gwc plotWC(gwc, asis=TRUE)
The options in wordcloud
or ggwordcloud
including color and rotation can be specified in argList
.
gwc <- refseq(inpSymbol, numWords=200, argList=list(max.words=200, random.order=FALSE, colors=RColorBrewer::brewer.pal(5, "Dark2"), rot.per=0.4), plotType="wc", scaleFreq=2) gwc plotWC(gwc, asis=TRUE)
Text summaries such as word clouds can be combined with other plots. For example, they can be displayed on reduced dimension plots in single-cell analysis. For other ways of combining them, please refer to the documentation.
The customized functions are available, which annotate the gene cluster relationship. If you perform some clustering analysis for gene expression data or other identifiers and investigate the relationship between clusters by dendrogram, the plotEigengeneNetworksWithWords
function can be used to populate the resulting dendrogram.
mod <- returnExample() plotEigengeneNetworksWithWords(mod$MEs, mod$colors) + scale_y_continuous(expand=c(0,1)) ## Scaled for labels to be not truncated
Other examples, such as interactive visualization of cluster networks using actual data and populating reduced dimension plots in single-cell transcriptomics, are described in the documentation.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.