The r Biocpkg("rsemmed")
package provides a way for users to explore connections between the biological concepts present in the Semantic MEDLINE database [@Kilicoglu:2011] in a programmatic way.
The Semantic MEDLINE database (SemMedDB) is a collection of annotations of sentences from the abstracts of articles indexed in PubMed. These annotations take the form of subject-predicate-object triples of information. These triples are also called predications.
An example predication is "Interleukin-12 INTERACTS_WITH IFNA1". Here, the subject is "Interleukin-12", the object is "IFNA1" (interferon alpha-1), and the predicate linking the subject and object is "INTERACTS_WITH". The Semantic MEDLINE database consists of tens of millions of these predications.
Semantic MEDLINE also provides information on the broad categories into which biological concepts (predication subjects and objects) fall. This information is called the semantic type of a concept. The databases assigns 4-letter codes to semantic types. For example, "gngm" represents "Gene or Genome". Every concept in the database has one or more semantic types (abbreviated as "semtypes").
Note: The information in Semantic MEDLINE is primarily computationally-derived. Thus, some information will seem nonsensical. For example, the reported semantic types of concepts might not quite match. The Semantic MEDLINE resource and this package are meant to facilitate an initial window of exploration into the literature. The hope is that this package helps guide more streamlined manual investigations of the literature.
The predications in SemMedDB can be represented in graph form. Nodes represent concepts, and directed edges represent predicates (concept linkers). In particular, the Semantic MEDLINE graph is a directed multigraph because multiple predicates are often present between pairs of nodes (e.g., "A ASSOCIATED_WITH B" and "A INTERACTS_WITH B"). r Biocpkg("rsemmed")
relies on the r CRANpkg("igraph")
package for efficient graph operations.
The full data underlying the complete Semantic MEDLINE database is available from from this National Library of Medicine site as SQL dump files. In particular, the PREDICATION table is the primary file that is needed to construct the database. More information about the Semantic MEDLINE database is available here.
See the inst/script
folder for scripts to perform the following processing of these raw files:
The next section describes details about the processing that occurs in these scripts to generate the graph representation.
In this vignette, we will explore a much smaller subset of the full graph that suffices to show the full functionality of r Biocpkg("rsemmed")
.
The graph representation of SemMedDB contains a processed and summarized form of the raw database. The toy example below illustrates the summarization performed.
Subject Subject semtype Predicate Object Object semtype
A aapp INHIBITS B gngm A gngm INHIBITS B aapp
The two rows show two predications that are treated as different predications because the semantic types ("semtypes") of the subject and object vary. In the processed data, such instances have been collapsed as shown below.
Subject Subject semtype Predicate Object Object semtype # instances
A aapp,gngm INHIBITS B aapp,gngm 2
The different semantic types for a particular concept are collapsed into a single comma-separated string that is available via igraph::vertex_attr(g, "semtype")
.
The "# instances" column indicates that the "A INHIBITS B" predication was observed twice in the database. This piece of information is available as an edge attribute via igraph::edge_attr(g, "num_instances")
. Similarly, predicate information is also an edge attribute accessible via igraph::edge_attr(g, "predicate")
.
A note of caution: Be careful when working with edge attributes in the Semantic MEDLINE graph manually. These operations can be very slow because there are over 18 million edges. Working with node/vertex attributes is much faster, but there are still a very large number of nodes (roughly 290,000).
The rest of this vignette will showcase how to use r Biocpkg("rsemmed")
functions to explore this graph.
To install r Biocpkg("rsemmed")
, start R and enter the following:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("rsemmed")
Load the r Biocpkg("rsemmed")
package and the g_small
object which contains a smaller version of the Semantic MEDLINE database.
library(rsemmed) data(g_small)
This loads an object of class igraph
named g_small
into the workspace. The SemMedDB graph object is a necessary input for most of r Biocpkg("rsemmed")
's functions.
(The full processed graph representation linked above contains an object of class igraph
named g
.)
The starting point for an r Biocpkg("rsemmed")
exploration is to find nodes related to the initial ideas of interest. For example, we may wish to find connections between the ideas "sickle cell trait" and "malaria".
The rsemmed::find_nodes()
function allows you to search for nodes by name. We supply the graph and a regular expression to use in searching through the name
attribute of the nodes. Finding the most relevant nodes will generally involve iteration.
To find nodes related to the sickle cell trait, we can start by searching for nodes containing the word "sickle". (Note: searches ignore capitalization.)
nodes_sickle <- find_nodes(g_small, pattern = "sickle") nodes_sickle
We may decide that only sickle cell anemia and the sickle trait are important. Conventional R subsetting allows us to keep the 3 related nodes:
nodes_sickle <- nodes_sickle[c(1,3,5)] nodes_sickle
We can also search for nodes related to "malaria":
nodes_malaria <- find_nodes(g_small, pattern = "malaria") nodes_malaria
There are r length(nodes_malaria)
results, not all of which are printed, so we can display all results by accessing the name
attribute of the returned nodes:
nodes_malaria$name
Perhaps we only want to keep the nodes that relate to disease. We could use direct subsetting, but another option is to use find_nodes()
again with nodes_malaria
as the input. Using the match
argument set to FALSE
allows us to prune unwanted matches from our results.
Below we iteratively prune matches to only keep disease-related results. Though this is not as condense as direct subsetting, it is more transparent about what was removed.
nodes_malaria <- nodes_malaria %>% find_nodes(pattern = "anti", match = FALSE) %>% find_nodes(pattern = "test", match = FALSE) %>% find_nodes(pattern = "screening", match = FALSE) %>% find_nodes(pattern = "pigment", match = FALSE) %>% find_nodes(pattern = "smear", match = FALSE) %>% find_nodes(pattern = "parasite", match = FALSE) %>% find_nodes(pattern = "serology", match = FALSE) %>% find_nodes(pattern = "vaccine", match = FALSE) nodes_malaria
The find_nodes()
function can also be used with the semtypes
argument which allows you to specify a character vector of semantic types to search for. If both pattern
and semtypes
are provided, they are combined with an OR
operation. If you would like them to be combined with an AND
operation, nest the calls in sequence.
## malaria OR disease (dsyn) find_nodes(g_small, pattern = "malaria", semtypes = "dsyn") ## malaria AND disease (dsyn) find_nodes(g_small, pattern = "malaria") %>% find_nodes(semtypes = "dsyn")
Finally, you can also select nodes by exact name with the names
argument. (Capitalization is ignored.)
find_nodes(g_small, names = "sickle trait") find_nodes(g_small, names = "SICKLE trait")
Now that we have nodes related to the ideas of interest, we can develop further understanding by asking the following questions:
To further Aim 1, we can use the rsemmed::find_paths()
function. This function takes two sets of nodes from
and to
(corresponding to the two different ideas of interest) and returns all shortest paths between nodes in from
("source" nodes) and nodes in to
("target" nodes). That is, for every possible combination of a single node in from
and a single node in to
, all shortest undirected paths between those nodes are found.
paths <- find_paths(graph = g_small, from = nodes_sickle, to = nodes_malaria)
find_paths()
The result of find_paths()
is a list with one element for each of the nodes in from
. Each element is itself a list of paths between from
and to
. In r CRANpkg("igraph")
, paths are represented as vertex sequences (class igraph.vs
).
Recall that nodes_sickle
contains the nodes below:
nodes_sickle
Thus, paths
is structured as follows:
paths[[1]]
is a list of paths originating from r nodes_sickle[1]$name
.paths[[2]]
is a list of paths originating from r nodes_sickle[2]$name
.paths[[3]]
is a list of paths originating from r nodes_sickle[3]$name
.With lengths()
we can show the number of shortest paths starting at each of the three source ("from") nodes:
lengths(paths)
There are two ways to display the information contained in these paths: rsemmed::text_path()
and rsemmed::plot_path()
.
text_path()
displays a text version of a pathplot_path()
displays a graphical version of the pathFor example, to show the 100th of the shortest paths originating from the first of the sickle trait nodes (paths[[1]][[100]]
), we can use text_path()
and plot_path()
as below:
this_path <- paths[[1]][[100]] tp <- text_path(g_small, this_path) tp plot_path(g_small, this_path)
plot_path()
plots the subgraph defined by the nodes on the path.
text_path()
sequentially shows detailed information about semantic types and predicates for the pairs of nodes on the path. It also invisibly returns a list of tibble
's containing the displayed information, where each list element corresponds to a pair of nodes on the path.
Finding paths between node sets necessarily uses shortest path algorithms for computational tractability. However, when these algorithms are run without modification, the shortest paths tend to be less useful than desired.
For example, one of the shortest paths from "sickle trait" to "Malaria, Cerebral" goes through the node "Infant":
this_path <- paths[[3]][[32]] plot_path(g_small, this_path)
This likely isn't the type of path we were hoping for. Why does such a path arise? For some insight, we can use the degree()
function within the r CRANpkg("igraph")
package to look at the degree distribution for all nodes in the Semantic MEDLINE graph. We also show the degree of the "Infant" node in red.
plot(density(degree(g_small), from = 0), xlab = "Degree", main = "Degree distribution") ## The second node in the path is "Infant" --> this_path[2] abline(v = degree(g_small, v = this_path[2]), col = "red", lwd = 2)
We can see why "Infant" would be on a shortest path connecting "sickle trait" and "Malaria, Cerebral". "Infant" has a very large degree, and most of its connections are likely of the uninteresting form "PROCESS_OF" (a predicate indicating that the subject node is a biological process that occurs in the organism represented by the object node).
We can discourage such paths from consideration by modifying edge weights. By default, all edges have a weight of 1 in the shortest path search, but we can effectively block off certain edges by giving them a high enough weight. For example, in rsemmed::make_edge_weights()
, this weight is chosen to equal the number of nodes in the entire graph. (Note that if all paths from the source node to the target node contain a given undesired edge, the process of edge reweighting will not prevent paths from containing that edge.)
The process of modifying edge weights starts by obtaining characteristics for all of the edges in the Semantic MEDLINE graph. This is achieved with the rsemmed::get_edge_features()
function:
e_feat <- get_edge_features(g_small) head(e_feat)
For every edge in the graph, the following information is returned in a tibble
:
semtype
) of the subject and object nodeYou can directly use the information from get_edge_features()
to manually construct custom weights for edges. This could include giving certain edges maximal weights as described above or encouraging certain edges by giving them lower weights.
The get_edge_features()
function also has arguments include_degree
, include_node_ids
, and include_num_instances
which can be set to TRUE
to include additional edge features in the output.
include_degree
: Adds information on the degree of the subject and object nodes and the degree percentile in the entire graph. (100th percentile = highest degree)include_node_ids
: Adds the integer IDs for the subject and object nodes. This IDs can be useful with r CRANpkg("igraph")
functions that compute various node/vertex metrics (e.g., centrality measures with igraph::closeness()
, igraph::edge_betweenness()
).include_num_instances
: Adds information on the number of times a particular edge (predication) was seen in the Semantic MEDLINE database. This might be useful if you want to weight edges based on how commonly the relationship was reported.make_edge_weights()
The rsemmed::make_edge_weights()
function provides a way to create weights that encourage and/or discourage certain features. It allows you to specify the node names, node semantic types, and edge predicates that you would like to include in and/or exclude from paths.
g
and e_feat
supply required graph metadata.node_semtypes_out
, node_names_out
, edge_preds_out
are supplied as character vectors of node semantic types, names, and edge predicates that you wish to exclude from shortest path results. These three features are combined with an OR operation. An edge that meets any one of these criteria is given the highest weight possible to discourage paths from including this edge.node_semtypes_in
, node_names_in
, edge_preds_in
are analogous to the "out" arguments but indicate types of edges you wish to include within shortest path results. Like with the "out arguments", these three features are combined with an OR operation. An edge that meets any one of these criteria is given a lower weight to encourage paths to include this edge.As an example of the impact of reweighting, let's examine the connections between "sickle trait" and "Malaria, Cerebral". In order to clearly see the effects of edge reweighting, below we obtain the paths from "sickle trait" to "Malaria, Cerebral":
paths_subset <- find_paths( graph = g_small, from = find_nodes(g_small, names = "sickle trait"), to = find_nodes(g_small, names = "Malaria, Cerebral") ) paths_subset <- paths_subset[[1]] par(mfrow = c(1,2), mar = c(3,0,1,0)) for (i in seq_along(paths_subset)) { cat("Path", i, ": ==============================================\n") text_path(g_small, paths_subset[[i]]) cat("\n") plot_path(g_small, paths_subset[[i]]) }
The "Child", "Woman", and "Infant" connections do not provide particularly useful biological insight. We could discourage paths from containing these nodes by specifically targeting those node names in the reweighting:
w <- make_edge_weights(g, e_feat, node_names_out = c("Child", "Woman", "Infant") )
However, in case there are other similar nodes (like "Teens"), we might want to discourage this group of nodes by specifying the semantic type corresponding to this group. We can see the semantic types of the nodes on the shortest paths as follows:
lapply(paths_subset, function(vs) { vs$semtype })
We can see that the "humn" and "popg" semantic types correspond to the class of nodes we would like to discourage. We supply them in the node_semtypes_out
argument and repeat the path search with these weights:
w <- make_edge_weights(g_small, e_feat, node_semtypes_out = c("humn", "popg")) paths_subset_reweight <- find_paths( graph = g_small, from = find_nodes(g_small, names = "sickle trait"), to = find_nodes(g_small, names = "Malaria, Cerebral"), weights = w ) paths_subset_reweight
The effect of that reweighting was likely not quite what we wanted. The discouraging of "humn" and "popg" nodes only served to filter down the 7 original paths to 4 shortest paths. Because the first 4 of the 7 original paths were not explicitly removed through the reweighting, they remained the shortest paths from source to target. If we would like to see different types of paths (longer paths), we should indicate that we would like to remove all of the original paths' middle nodes. We can use the rsemmed::get_middle_nodes()
function to obtain a character vector of names of middle nodes in a path set.
## Obtain the middle nodes (2nd node on the path) out_names <- get_middle_nodes(g_small, paths_subset) ## Readjust weights w <- make_edge_weights(g_small, e_feat, node_names_out = out_names, node_semtypes_out = c("humn", "popg") ) ## Find paths with new weights paths_subset_reweight <- find_paths( graph = g_small, from = find_nodes(g_small, pattern = "sickle trait"), to = find_nodes(g_small, pattern = "Malaria, Cerebral"), weights = w ) paths_subset_reweight <- paths_subset_reweight[[1]] ## How many paths? length(paths_subset_reweight)
There is clearly a much greater diversity of paths resulting from this search.
par(mfrow = c(1,2), mar = c(2,1.5,1,1.5)) plot_path(g_small, paths_subset_reweight[[1]]) plot_path(g_small, paths_subset_reweight[[2]]) plot_path(g_small, paths_subset_reweight[[1548]]) plot_path(g_small, paths_subset_reweight[[1549]])
When dealing with paths from several source and target nodes, it can be helpful to obtain the middle nodes on paths for specific source-target pairs. By default get_midddle_nodes()
returns a single character vector of middle node names across all of the paths supplied. By using collapse = FALSE
, the names of middle nodes can be returned for every source-target pair. When collapse = FALSE
, this function enumerates all source-target pairs in tibble
form. For every pair of source and target nodes in the paths object supplied, the final column (called middle_nodes
) provides the names of the middle nodes as a character vector. (middle_nodes
is a list-column.)
get_middle_nodes(g_small, paths, collapse = FALSE)
The make_edge_weights
function can also encourage certain features. Below we simultaneously discourage the "humn"
and "popg"
semantic types and encourage the "gngm"
and "aapp"
semantic types.
w <- make_edge_weights(g_small, e_feat, node_semtypes_out = c("humn", "popg"), node_semtypes_in = c("gngm", "aapp") ) paths_subset_reweight <- find_paths( graph = g_small, from = find_nodes(g_small, pattern = "sickle trait"), to = find_nodes(g_small, pattern = "Malaria, Cerebral"), weights = w ) paths_subset_reweight <- paths_subset_reweight[[1]] length(paths_subset_reweight)
When there are many shortest paths, it can be useful to get a high-level summary of the nodes and edges on those paths. The rsemmed::summarize_semtypes()
function tabulates the semantic types of nodes on paths, and the rsemmed::summarize_predicates()
functions tabulates the predicates of the edges.
summarize_semtypes()
removes the first and last node from the paths by default because that information is generally easily accessible by using nodes_from$semtype
and nodes_to$semtype
. Further, if the start and end nodes are not removed, they would be duplicated in the tabulation a number of times equal to the number of paths, which likely is not desirable.
summarize_semtypes()
invisibly returns a tibble
where each row corresponds to a pair of source (from
) and target (to
) nodes in the paths object supplied, and the final semtypes
column is a list-column containing a table
of semantic type information. It automatically prints the semantic type tabulations for each from
-to
pair, but if you would like to turn off printing, use print = FALSE
.
## Reweighted paths from "sickle trait" to "Malaria, Cerebral" semtype_summary <- summarize_semtypes(g_small, paths_subset_reweight) semtype_summary semtype_summary$semtypes[[1]]
## Original paths from "sickle" to "malaria"-related notes summarize_semtypes(g_small, paths)
The summarize_predicates()
function works similarly to give information on predicate counts.
edge_summary <- summarize_predicates(g_small, paths) edge_summary edge_summary$predicates[[1]]
Another way in which we can explore relations between ideas is to slowly expand a single set of ideas to see what other ideas are connected. We can do this with the grow_nodes()
function. The grow_nodes()
function takes a set of nodes and obtains the nodes that are directly connected to any of these nodes. That is, it obtains the set of nodes that are distance 1 away from the supplied nodes. We can call this set of nodes the "1-neighborhood" of the supplied nodes.
nodes_sickle_trait <- nodes_sickle[2:3] nodes_sickle_trait nbrs_sickle_trait <- grow_nodes(g_small, nodes_sickle_trait) nbrs_sickle_trait
Not all nodes in the 1-neighborhood will be useful, and we may wish to remove them with find_nodes(..., match = FALSE)
. We can use summarize_semtypes()
to begin to identify such nodes. Using the argument is_path = FALSE
will change the format of the display and output to better suit this situation.
nbrs_sickle_trait_summ <- summarize_semtypes(g_small, nbrs_sickle_trait, is_path = FALSE)
The printed summary displays nodes grouped by semantic type. The semantic types are ordered such that the semantic type with the highest degree node is shown first. Often, these high degree nodes are less interesting because they represent fairly broad concepts.
node_degree
column shows the degree of the node in the Semantic MEDLINE graph.node_degree_perc
column gives the percentile of the node degree relative to all nodes in the Semantic MEDLINE graph.The resulting tibble
(nbrs_sickle_trait_summ
) contains the same information that is printed and provides another way to mine for nodes to remove.
After inspection of the summary, we can remove nodes based on semantic type and/or name. We can achieve this with find_nodes(..., match = FALSE)
. The ...
can be any combination of the pattern
, names
, or semtypes
arguments. If a node matches any of these pieces, it will be excluded with match = FALSE
.
length(nbrs_sickle_trait) nbrs_sickle_trait2 <- nbrs_sickle_trait %>% find_nodes( pattern = "^Mice", semtypes = c("humn", "popg", "plnt", "fish", "food", "edac", "dora", "aggp"), names = c("Polymerase Chain Reaction", "Mus"), match = FALSE ) length(nbrs_sickle_trait2)
It is natural to consider a chaining like below as a strategy to iteratively explore outward from a seed idea.
seed_nodes %>% grow_nodes() %>% find_nodes() %>% grow_nodes() %>% find_nodes()
Be careful when implementing this strategy because the grow_nodes()
step has the potential to return far more nodes than is manageable very quickly. Often after just two sequential uses of grow_nodes()
, the number of nodes returned can be too large to efficiently sift through unless you conduct substantial filtering with find_nodes()
between uses of grow_nodes()
.
In summary, the r Biocpkg("rsemmed")
package provides tools for finding and connecting biomedical concepts.
find_nodes()
function.find_paths()
.make_edge_weights()
function will allow you to tailor path-finding by creating custom weights. It requires metadata provided by get_edge_features()
.get_middle_nodes()
, summarize_semtypes()
, and summarize_predicates()
functions all help explore paths/node collections to inform reweighting.grow_nodes()
.Your workflow will likely involve iteration between all of these different components.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.