knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = 'center', crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
Network motifs are the building blocks of complex networks, and they can be
interpreted as small genetic circuits. Identifying and counting motifs in
gene regulatory networks can reveal important aspects of the evolution of
transcriptional regulation. In particular, they can be used to explore
the impact of gene duplication in the rewiring of
regulatory interactions [@mottes2021impact]. magrene
aims to identify and
analyze motifs in (duplicated) gene regulatory networks to better comprehend
the role of gene duplication in network evolution. The figure below shows
the networks motifs users can identify with magrene
.
knitr::include_graphics("motifs_vignette.png")
You can install magrene
with:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("magrene")
Then, you can load the package with:
library(magrene) set.seed(123) # for reproducibility
For this vignette, we will use three example data sets:
gma_grn: A gene regulatory network inferred with
BioNERO
[@almeida2022bionero] using soybean RNA-seq data from the Soybean
Expression Atlas [@machado2020systematic].
gma_ppi: A protein-protein interaction (PPI) network for soybean obtained from the STRING database [@szklarczyk2021string], filtered to contain only physical interactions with confidence score > 0.4.
gma_paralogs: Soybean paralogous gene pairs derived by whole-genome, tandem, proximal, transposed, and dispersed duplications (WGD, TD, PD, TRD, and DD, respectively). This data set was obtained from @almeida2020exploring.
Networks are represented as edge lists. Let's take a look at the data.
data(gma_grn) head(gma_grn) data(gma_ppi) head(gma_ppi) data(gma_paralogs) head(gma_paralogs)
Motifs can be found using find_*
motifs, as shown in Figure 1. Each function
returns a character vector of motifs, and each motif has its own character
representation. Let's demonstrate how they work.
For the sake of demonstration, we will only use WGD-derived paralogs.
For GRN motifs, we will only use a smaller subset of the edge list.
# Include only WGD-derived paralogs paralogs <- gma_paralogs[gma_paralogs$type == "WGD", 1:2] # Keep only the top 30k edges of the GRN, remove "Weight" variable grn <- gma_grn[1:30000, 1:2]
PPI V motifs are paralogous proteins that share an interaction partner.
To find them, you will use find_ppi_v()
. The character representation of
PPI V motifs is:
$$ \text{paralog1-partner-paralog2} $$
# Find PPI V motifs ppi_v <- find_ppi_v(gma_ppi, paralogs) head(ppi_v)
V motifs are paralogous regulators that regulate the same target.
These motifs can be created when a regulator undergoes a
small-scale duplication. To find them, you will use find_v()
.
The character representation of V motifs is:
$$ \text{regulator->target<-regulator} $$
# Find V motifs v <- find_v(grn, paralogs) head(v)
Lambda motifs are the opposite of V motifs: a single regulator that regulates
two target genes that are paralogous. These motifs can be created when
an ancestral target gene undergoes a small-scale duplication. To find them
you will use find_lambda()
. The character representation of lambda motifs is:
$$ \text{target1<-regulator->target2} $$
lambda <- find_lambda(grn, paralogs) head(lambda)
Delta motifs are pretty similar to lambda motifs, but here we take
protein-protein interactions between targets into account. Thus, they are
represented by a regulator that regulates two targets that interact at the
protein level. They can be created by the same evolutionary mechanism of
lambda motifs. To find them, you will use find_delta()
. The character
representation of delta motifs is:
$$ \text{target1<-regulator->target2} $$
To find delta motifs, you have two options:
Pass PPI edge list + a vector of previously identified lambda motifs (recommended).
Pass PPI edge list + GRN edge list + paralogs data frame. In this option,
find_delta()
will find lambda motifs first, then use the lambda vector to
find delta motifs. If you have identified lambda motifs beforehand, it is
way faster to pass the lambda vector to find_delta()
, so you don't have
to do double work.
# Find delta motifs from lambda motifs delta <- find_delta(edgelist_ppi = gma_ppi, lambda_vec = lambda) head(delta)
Bifan motifs are the most complex: they are represented by
two paralogous regulators that regulate the same set of two paralogous targets.
They can be created when both the ancestral regulator and the ancestral target
are duplicated by small-scale duplications, or when the genome undergoes a
whole-genome duplication event. To find these motifs,
you will use find_bifan()
. The character representation of bifan motifs is:
$$ \text{regulator1,regulator2->target1,target2} $$
Under the hood, what find_bifan()
does it to find lambda motifs involving
the same targets and check if their regulators are paralogs. Thus, if you have
identified lambda motifs beforehand, it is much faster to simply give them
to find_bifan()
, so it doesn't have to find them again.
# Find bifans from lambda motifs bifan <- find_bifan(paralogs = paralogs, lambda_vec = lambda) head(bifan)
As motifs are simple character vectors, one can count their frequencies with
the base R length()
function. For example, let's count the frequency of
each motif in our example data set:
count_df <- data.frame( Motif = c("PPI V", "V", "Lambda", "Delta", "Bifan"), Count = c( length(ppi_v), length(v), length(lambda), length(delta), length(bifan) ) ) count_df
However, unless you have another data set to which you can compare your
frequencies, counting is not enough. You need to evaluate the significance
of your motif frequencies. One way to do that is by comparing your observed
frequencies to a null distribution generated by counting motifs in
N (e.g., 1000) simulated networks [^1]. magrene
allows you to generate
null distributions of motif frequencies for each motif type with the function
generate_nulls()
. As generating the null distributions takes a bit of time,
we will demonstrate generate_nulls()
with 5 permutations only.
As a rule of thumb, you would probably want N >= 1000.
generate_nulls(grn, paralogs, gma_ppi, n = 5)
[^1]: NOTE: Simulated networks are created by node label permutation (i.e., resampling node labels without replacement). This method allows you to have random networks that preserve the same degree of the original network. Hence, networks are called degree-preserving simulated networks.
As you can see, the output of generate_nulls()
is a list of numeric vectors
with the frequency of each motif type in the simulated networks[^2].
You can use the null distribution to calculate Z-scores for your observed
frequencies, which are defined as below:
[^2]: Note on performance: The function generate_nulls()
can be
parallelized thanks to the Bioconductor
package r BiocStyle::Biocpkg("BiocParallel")
. However, keep in mind that
parallelization is not always the best choice, because it may
take longer to create multiple copies of your data to split into multiple
cores than it takes to find motifs with a single core.
$$ Z = \frac{ n_{observed} - \bar{n}{null} }{ \sigma{null} } $$
To calculate Z-scores, you can use the function calculate_Z()
. As input,
you need to give a list of observed frequencies and a list of nulls.
Here, we will load pre-computed null distributions of N = 100.
# Load null distros data(nulls) head(nulls) # Create list of observed frequencies observed <- list( lambda = length(lambda), bifan = length(bifan), V = length(v), PPI_V = length(ppi_v), delta = length(delta) ) calculate_Z(observed, nulls)
Now that you have Z-scores, you can use a cut-off of your choice to define significance.
Finally, another interesting pattern you may want to analyze is the interaction similarity between paralogous gene pairs. Previous studies have demonstrated that the Sorensen-Dice similarity is a suitable index for interaction similarity [@defoort2019evolution; @mottes2021impact], which is defined as:
$$ S(A,B) = \frac{2 \left| A \cap B \right|}{ \left|A \right| + \left| B \right|} $$
where A and B are the interacting partners of nodes A and B.
To calculate the Sorensen-Dice similarity for paralogous gene pairs,
you can use the function sd_similarity()
. Let's demonstrate it by
calculating the similarity between paralogs in the PPI network.
sim <- sd_similarity(gma_ppi, paralogs) head(sim) summary(sim$sorensen_dice)
This document was created under the following conditions:
sessioninfo::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.