BiocStyle::markdown()
library("knitr") opts_chunk$set(stop_on_error = 1L) suppressPackageStartupMessages(library("MetNet"))
Among the main challenges in mass spectrometric metabolomic analysis is the
high-throughput analysis of metabolic features, their fast detection and
annotation.
By contrast to the screening of known, previously characterized,
metabolic features in these data, the putative annotation of unknown
features is often cumbersome and requires a lot of manual work, hindering
the biological information retrieval of these data.
High-resolution mass spectrometric data is often very rich in information
content and metabolic conversions, and reactions can be derived from structural
properties of features [@Breitling2006].
In addition to that, statistical associations between
features (based on their intensity values) can be a valuable ressource to find
co-synthesised or co-regulated metabolites, which are synthesised in the same
biosynthetic pathways. Given that an analysis tool within the R
framework
is still lacking that is
integrating the two features of mass spectrometric information commonly
acquired with mass spectrometers (m/z and intensity values), I developed
MetNet
to close this gap.
The MetNet
package comprises functionalities to infer network
topologies from high-resolution mass spectrometry data. MetNet
combines information from both structural data (differences in m/z values
of features) and statistical associations (intensity values of features per
sample) to propose putative metabolic networks that can be used for further
exploration.
The idea of using high-resolution mass spectrometry data for network construction was first proposed in @Breitling2006 and followed soon afterwards by a Cytoscape plugin, MetaNetter [@Jourdan2007], that is based on the inference of metabolic networks on molecular weight differences and correlation (Pearson correlation and partial correlation).
Inspired by the paper of @Marbach2012 different algorithms for network
were implemented in MetNet
to account for
biases that are inherent in these statistical methods, followed by the
calculation of a consensus adjacency matrix using the differently computed
individual adjacency matrices.
The two main functionalities of the package include the creation of an adjacency matrix from structual properties, based on losses/addition of functional groups defined by the user, and statistical associations. Currently, the following statistical models are implemented to infer a statistical adjacency matrix: Least absolute shrinkage and selection operator (LASSO, L1-norm regression, [@Tibshirani1994]), Random Forest [@Breiman2001], Pearson and Spearman correlation (including partial and semipartial correlation, see @Steuer2006 for a discussion on correlation-based metabolic networks), context likelihood of relatedness (CLR, [@Faith2007]), the algorithm for the reconstruction of accurate cellular networks (ARACNE, [@Margolin2006]) and constraint-based structure learning (Bayes, [@Scutari2010]). Since all of these methods have advantages and disadvantages, the user has the possibility to select several of these methods, compute adjacency matrices from these models and create a consensus matrix from the different statistical frameworks.
After creating the statistical and structural adjaceny matrices these two matrices can be combined to form a consensus matrix that has both information from structural and statistical properties of the data. This can be followed by further network analyses (e.g. calculation of topological parameters), integration with other data sources (e.g. genomic information or transcriptomic data) and/or visualization.
MetNet
is currently under active development. If you
discover any bugs, typos or develop ideas of improving
MetNet
feel free to raise an issue via
Github or
send a mail to the developer.
To install MetNet
enter the following to the R
console
install.packages("BiocManager") BiocManager::install("MetNet")
Before starting with the analysis, load the MetNet
package. This
will also load the required packages glmnet
, stabs
,
randomForest
, rfPermute
, mpmi
,
parmigene
, WGCNA
and bnlearn
that are needed
for functions in the statistical adjacency matrix inference.
library(MetNet)
The data format that is compatible with the MetNet
framework is
in the xcms
/CAMERA
output-like $m~\times~n$ matrix, where
columns denote the different samples $n$ and where $m$ features are present.
In such a matrix, information about the masses of the features and quantitative
information of the features (intensity or concentration values) are needed.
The information about the m/z values has to be stored in a vector of
length $\vert m \vert$ in the column "mz"
.
MetNet
does not impose any requirements for
data normalization, filtering, etc. However, the user has to make sure that
the data is properly preprocessed. These include division by internal standard,
log2
transformation, noise filtering, removal of features that do not
represent mass features/metabolites, removal of isotopes, etc.
We will load here the object x_test
that contains m/z values
(in the column "mz"
), together with the corresponding retention time
(in the column "rt"
) and intensity values. We will use here the object
x_test
for guidance through the workflow of MetNet
.
data("x_test", package = "MetNet") x_test <- as.matrix(x_test)
The function structural
will create the adjacency
matrix based on structual properties (m/z values) of the features.
The function expects a matrix with a column "mz"
that contains the
mass information of a feature (typically the m/z value). Furthermore,
structural
takes a data.frame
object as argument transformations
with the colnames
"mass"
, "name"
and additional columns (e.g. "formula"
).
structural
looks for transformation (in the
sense of additions/losses of functional groups mediated by biochemical,
enzymatic reactions) in the data using the mass information.
Following the work of [@Breitling2006] and [@Jourdan2007], molecular weight difference w~X~ is defined by $w_X = \vert w_A - w_B \vert$
where w~A~ is the molecular weight of substrate A, and w~B~ is the molecular weight of product B (typically, m/z values will be used as a proxy for the molecular weight since the molecular weight is not directly derivable from mass spectrometric data). As examplified in [@Jourdan2007] specific enzymatic reactions refer to specific changes in the molecular weight, e.g. carboxylation reactions will result in a mass difference of 43.98983 (molecular weight of CO~2~) between metabolic features.
The search space for these transformation is adjustable by the
transformation
argument in
structural
allowing to look for specific
enzymatic transformations in mind. Hereby,
structural
will take into account the
ppm
value, to adjust for inaccuracies in m/z values due to technical
reasons according to the formula
$$ppm = \frac{m_{exp} - m_{calc}}{m_{exp}} \cdot 10^{-6}$$
with m~exp~ the experimentally determined m/z value and m~calc~ the
calculated accurate mass of a molecule. Within the function, a lower and upper
range is calculated depending on the supplied ppm
value, differences
between the m/z feature values are calculated and matched against the
"mass"
es of the transformations
argument. If any
of the additions/losses defined in transformations
is found in the
data, it will be reported as an (unweighted) connection in the
returned adjacency matrix. Together with the adjacency matrix the type of
connection (derived from the column "name"
in the
transformations
) will be written to a character matrix. These
two matrices will be returned as a list (first entry: numerical adjacency
matrix, second entry: character matrix) by the function
structural
.
Before calculating the structural matrix, one must define the search space,
i.e. these transformation that will be looked for in the mass spectrometric
data by creating the transformations
object.
## define the search space for biochemical transformation transformations <- rbind( c("Hydroxylation (-H)", "O", 15.9949146221, "-"), c("Malonyl group (-H2O)", "C3H2O3", 86.0003939305, "+"), c("D-ribose (-H2O) (ribosylation)", "C5H8O4", 132.0422587452, "-"), c("C6H10O6", "C6H10O6", 178.0477380536, "-"), c("Rhamnose (-H20)", "C6H10O4", 146.057910, "-"), c("Monosaccharide (-H2O)", "C6H10O5", 162.0528234315, "-"), c("Disaccharide (-H2O) #1", "C12H20O10", 324.105649, "-"), c("Disaccharide (-H2O) #2", "C12H20O11", 340.1005614851, "-"), c("Trisaccharide (-H2O)", "C18H30O15", 486.1584702945, "-"), c("Glucuronic acid (-H2O)", "C6H8O6", 176.0320879894, "?"), c("coumaroyl (-H2O)", "C9H6O2", 146.0367794368, "?"), c("feruloyl (-H2O)", "C9H6O2OCH2", 176.0473441231, "?"), c("sinapoyl (-H2O)", "C9H6O2OCH2OCH2", 206.0579088094, "?"), c("putrescine to spermidine (+C3H7N)", "C3H7N", 57.0578492299, "?")) ## convert to data frame transformations <- data.frame( group = transformations[, 1], formula = transformations[, 2], mass = as.numeric(transformations[, 3]), rt = transformations[, 4])
The function structural
will then check for those
m/z differences that are stored in the column "mass"
in the
object transformations
. To create the adjacency matrix derived
from these structural information we enter
struct_adj <- structural(x = x_test, transformation = transformations, ppm = 10)
in the R
console.
Depending on the chemical group added the retention time will differ depending
on the chemical group added, e.g. an addition of a glycosyl group will
usually result in a lower retentiom time in reverse-phase chromatography-
This information can be used in refining the adjacency matrix derived from
the structural matrix. The rtCorrection
does this checking, if
predicted transformation correspond to the expected retention time shift,
in an automated fashion. It requires information about the expected retention
time shift in the data.frame
passed to the transformation
argument (in the "rt"
column). Within this columns, information about
retention time shifts is encoded by "-"
, "+"
and "?"
,
which means the feature with higher m/z value has lower, higher or unknown
retention time than the feature with the lower m/z value. The values for
m/z and retention time will be taken from the object passed to the
x
argument. In case there is a discrepancy between the transformation
and the retention time shift the adjacency matrix at the specific position
will be set to 0. rtCorrection
will return the
updated adjacency matrix and the updated character matrix with the descriptions
of the transformation.
To account for retention time shifts we enter
struct_adj <- rtCorrection(structural = struct_adj, x = x_test, transformation = transformations)
in the R
console.
statistical
{#subsec-statistical}The function statistical
will create the adjacency
matrix based on statistical associations. The function will create a
list of weighted adjacency matrices using the statistical models defined by the
model
argument. Currently, the models LASSO (using stabs
,
[@Hofner2015;@Thomas2017]), Random Forest (using GENIE3
,
CLR, ARACNE (the two latter using the package mpmi
to calculate
Mutual Information using a nonparametric bias correction by
Bias Corrected Mutual Information, and the functions clr
and
aracne.a
from the parmigene
package), Pearson and
Spearman correlation (based on the
stats
package), partial and semipartial
Pearson and Spearman correlation (using the ppcor
package) and
score-based structure learning returning the strength of the probabilistic
relationships of the arcs of
a Bayesian network, as learned from bootstrapped data (using the
boot.strength
with the Tabu greedy search as default
from the bnlearn
package [@Scutari2010]).
For further information on the different models
take a look on the respective help pages of lasso
,
randomForest
, clr
, aracne
, correlation
and/or
bayes
. Arguments that are accepted by the respective underlying
functions can be passed directly to the statistical
function. In addition,
arguments that are defined in the functions lasso
,
randomForest
, clr
, aracne
, correlation
and/or
bayes
can be passed to the functions.
threshold
{#subsec-threshold}From the list of adjacency matrices the function threshold
will create a unweighted adjacency matrix from the weighted adjacency matrices
unifying the information present from all statistical models.
The reasoning behind this step is to circumvent disadvantages arising from each
model and creating a statistically reliable topology that reflects the actual
metabolic relations. threshold
return an unweighted adjancency
matrix with connections inferred from the respective models.
There are four different types implemented how the unweighted adjacency
matrix can be created: threshold
, top1
, top2
, mean
.
For type = "threshold"
, threshold values have to be defined for the
args
argument for each respective statistical model, above or below
which
the each in each weighted adjacency matrix will be reported as a unweighted
each. The unweighted adjacency matrices will be passed to the
consensus
function from the sna
[@Butts2016]
to calculate the unweighted consensus adjacency matrix.
The arguments that are accepted by this function can be passed to the
threshold
function. Furthermore, in args
an entry threshold
needs
to be defined to threshold if the value a~i,j~ of the consensus adjacency
matrix will be reported as a connection in the returned matrix
(if a~i,j~ ( \geq ) threshold
) or not. In the case of the method
"central.graph"
(default), the argument threshold
should be
set to 1.
For the other three types (top1, top2, mean) the ranks per statistical model
will be calculated and from each respective link the top1, top 2 or mean rank
across statistical models will be calculated (cf. [@Hase2013]). The
top n unqique ranks (defined by the entry n in args
) will be returned
as links in the unweighted consensus adjacency matrix.
In the following example, we will create a list of unweighted adjacency matrices using Pearson and Spearman correlation using the intensity values as input data.
x_int <- x_test[, 3:dim(x_test)[2]] x_int <- as.matrix(x_int) stat_adj_l <- statistical(x_int, model = c("pearson", "spearman"))
threshold
implements four types to obtain an unweighted adjacency
matrix. We will create for all types the unweighted consensus adjacency
matrices.
## type = "threshold" args_thr <- list("pearson" = 0.95, "spearman" = 0.95, threshold = 1) stat_adj_thr <- threshold(statistical = stat_adj_l, type = "threshold", args = args_thr) ## type = "top1" args_top <- list(n = 40) stat_adj_top1 <- threshold(statistical = stat_adj_l, type = "top2", args = args_top) ## type = "top2" stat_adj_top2 <- threshold(statistical = stat_adj_l, type = "top2", args = args_top) ## type = "mean" stat_adj_mean <- threshold(statistical = stat_adj_l, type = "mean", args = args_top)
After creating the unweighted structural and unweighted statistical adjacency
matrices, it is time to combine
these two matrices. The function combine
will combine the matrices to the consensus matrix. The function accepts
the arguments structure
and statistical
for the list returned
by structural
and the matrix returned
by threshold
,
respectively, and the argument threshold
, that is a numerical value
(default = 1). After adding the matrices, the entries will be checked if
they are greater or equal than threshold
and 1 or 0 will be returned,
respectively. The argument threshold
needs to be adjusted by the user
if another method
than "central.graph"
in
threshold
(type = "threshold") is used.
We will use here the unweighted statistical adjacency matrix from
type = "mean"
:
cons_adj <- combine(structural = struct_adj, statistical = stat_adj_mean)
To display the created consensus adjacency matrix, existing visualization
tools available in the R
framework can be employed or any other visualization
tool after exporting the consensus matrix as a text file. In this example
We will use the igraph
[@Csardi2006] package to visualize the
adjacency matrix.
combine
returns a list of two adjacency matrices, where the
first entry contains the unweighted adjacency matrix and the second entry
contains a matrix given information on the type of link between features
based on structural information. Only the first entry of the list will be
passed to the graph_from_adjacency_matrix
function:
g <- igraph::graph_from_adjacency_matrix(cons_adj[[1]], mode = "undirected") plot(g, edge.width = 5, vertex.label.cex = 0.5, edge.color = "grey")
Furthermore, the network can be analysed by network analysis techniques
(topological parameters such as centrality, degree, clustering indices) that
are implemented in different packages in R
(e.g. igraph
or sna
) or other software tools outside of
the R
environment.
All software and respective versions to build this vignette are listed here:
sessionInfo()
The list of transformations is taken from @Breitling2006. The numerical m/z values were calculated by using the structural formula and the Biological Magnetic Resonance Data Bank web tool.
transformations <- rbind( c("Alanine", "C3H5NO", "71.0371137878"), c("Arginine", "C6H12N4O", "156.1011110281"), c("Asparagine", "C4H6N2O2", "114.0429274472"), c("Guanosine 5-diphosphate (-H2O)", "C10H13N5O10P2", "425.0137646843"), c("Guanosine 5-monophosphate (-H2O)", "C10H12N5O7P", "345.0474342759"), c("Guanine (-H)", "C5H4N5O", "150.0415847765"), c("Aspartic acid", "C4H5NO3", "115.0269430320"), c("Guanosine (-H2O)", "C10H11N5O4", "265.0811038675"), c("Cysteine", "C3H5NOS", "103.0091844778"), c("Deoxythymidine 5'-diphosphate (-H2O)", "C10H14N2O10P2", "384.01236770"), c("Cystine", "C6H10N2O3S2", "222.0132835777"), c("Thymidine (-H2O)", "C10H12N2O4", "224.0797068840"), c("Glutamic acid", "C5H7NO3", "129.0425930962"), c("Thymine (-H)", "C5H5N2O2", "125.0351024151"), c("Glutamine", "C5H8N2O2", "128.0585775114"), c("Thymidine 5'-monophosphate (-H2O)", "C10H13N2O7P", "304.0460372924"), c("Glycine", "C2H3NO", "57.0214637236"), c("Uridine 5'-diphosphate (-H2O)", "C9H12N2O11P2", "385.9916322587"), c("Histidine", "C6H7N3O", "137.0589118624"), c("Uridine 5'-monophosphate (-H2O)", "C9H11N2O8P", "306.0253018503"), c("Isoleucine", "C6H11NO", "113.0840639804"), c("Uracil (-H)", "C4H3N2O2", "111.0194523509"), c("Leucine", "C6H11NO", "113.0840639804"), c("Uridine (-H2O)", "C9H10N2O5", "226.0589714419"), c("Lysine", "C6H12N2O", "128.0949630177"), c("Acetylation (-H)", "C2H3O2", "59.0133043405"), c("Methionine", "C5H9NOS", "131.0404846062"), c("Acetylation (-H2O)", "C2H2O", "42.0105646863"), c("Phenylalanine", "C9H9NO", "147.0684139162"), c("C2H2", "C2H2", "26.0156500642"), c("Proline", "C5H7NO", "97.0527638520"), c("Carboxylation", "CO2", "43.9898292442"), c("Serine", "C3H5NO2", "87.0320284099"), c("CHO2", "CHO2", "44.9976542763"), c("Threonine", "C4H7NO2", "101.0476784741"), c("Condensation/dehydration", "H2O", "18.0105646863"), c("Tryptophan", "C11H10N2O", "186.0793129535"), c("Diphosphate", "H3O6P2", "160.9404858489"), c("Tyrosine", "C9H9NO2", "163.0633285383"), c("Ethyl addition (-H2O)", "C2H4", "28.0313001284"), c("Valine", "C5H9NO", "99.0684139162"), c("Formic Acid (-H2O)", "CO", "27.9949146221"), c("Acetotacetate (-H2O)", "C4H4O2", "84.0211293726"), c("Glyoxylate (-H2O)", "C2O2", "55.9898292442"), c("Acetone (-H)", "C3H5O", "57.0340397826"), c("Hydrogenation/dehydrogenation", "H2", "2.0156500642"), c("Adenylate (-H2O)", "C10H12N5O6P", "329.0525196538"), c("Hydroxylation (-H)", "O", "15.9949146221"), c("Biotinyl (-H)", "C10H15N2O3S", "243.0803380482"), c("Inorganic phosphate", "P", "30.9737615100"), c("Biotinyl (-H2O)", "C10H14N2O2S", "226.0775983940"), c("Ketol group (-H2O)", "C2H2O", "42.0105646863"), c("Carbamoyl P transfer (-H2PO4)", "CH2ON", "44.0136386915"), c("Methanol (-H2O)", "CH2", "14.0156500642"), c("Co-enzyme A (-H)", "C21H34N7O16P3S", "765.0995583014"), c("Phosphate", "HPO3", "79.9663304084"), c("Co-enzyme A (-H2O)", "C21H33N7O15P3S", "748.0968186472"), c("Primary amine", "NH2", "16.0187240694"), c("Glutathione (-H2O)", "C10H15N3O5S", "289.0732412976"), c("Pyrophosphate", "PP", "61.9475230200"), c("Isoprene addition (-H)", "C5H7", "67.0547752247"), c("Secondary amine", "NH", "15.0108990373"), c("Malonyl group (-H2O)", "C3H2O3", "86.0003939305"), c("Sulfate (-H2O)", "SO3", "79.9568145563"), c("Palmitoylation (-H2O)", "C16H30O", "238.2296655851"), c("Tertiary amine", "N", "14.0030740052"), c("Pyridoxal phosphate (-H2O)", "C8H8NO5P", "229.0140088825"), c("C6H10O5", "C6H10O5", "162.0528234315"), c("Urea addition (-H)", "CH3N2O", "59.0245377288"), c("C6H10O6", "C6H10O6", "178.0477380536"), c("Adenine (-H)", "C5H4N5", "134.0466701544"), c("D-ribose (-H2O) (ribosylation)", "C5H8O4", "132.0422587452"), c("Adenosine (-H2O)", "C10H11N5O3", "249.0861892454"), c("Disaccharide (-H2O) #1", "C12H20O10", "324.105649"), c("Disaccharide (-H2O) #2", "C12H20O11", "340.1005614851"), c("Adenosine 5'-diphosphate (-H2O)", "C10H13N5O9P2", "409.0188500622"), c("Glucose-N-phosphate (-H2O)", "C6H11O8P", "242.0191538399"), c("Adenosine 5'-monophosphate (-H2O)", "C10H12N5O6P", "329.0525196538"), c("Glucuronic acid (-H2O)", "C6H8O6", "176.0320879894"), c("Cytidine 5'-diphosphate (-H2O)", "C9H13N3O10P2", "385.0076166739"), c("Monosaccharide (-H2O)", "C6H10O5", "162.0528234315"), c("Cytidine 5'-monophsophate (-H2O)", "C9H12N3O7P", "305.0412862655"), c("Trisaccharide (-H2O)", "C18H30O15", "486.1584702945"), c("Cytosine (-H)", "C4H4N3O", "110.0354367661")) transformations <- data.frame(name = transformations[, 1], formula = transformations[, 2], mass = as.numeric(transformations[, 3]))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.