BiocStyle::markdown()
Authors: Johannes Rainer
Modified: r file.info("create-compounddb.Rmd")$mtime
Compiled: r date()
Chemical compound annotation and information can be retrieved from a variety of
sources including HMDB,
LipidMaps or
ChEBI. The CompoundDb
package provides
functionality to extract data relevant for (chromatographic) peak annotations in
metabolomics/lipidomics experiments from these sources and to store it into a
common format (i.e. an CompDb
object/database). This vignette describes how
such CompDb
objects can be created exemplified with package-internal test
files that represent data subsets from some annotation resources.
The R object to represent the compound annotation is the CompDb
object. Each
object (respectively its database) is supposed to contain and provide
annotations from a single source (e.g. HMDB or LipidMaps) but it is also
possible to create cross-source databases too.
CompDb
object for HMDBThe CompDb
package provides the compound_tbl_sdf
and the
compound_tbl_lipidblast
functions to extract relevant compound annotation from
files in SDF (structure-data file) format or in the json files from LipidBlast
(http://mona.fiehnlab.ucdavis.edu/downloads). CompoundDb
allows to process SDF
files from:
Below we use the compound_tbl_sdf
to extract compound annotations from a SDF
file representing a very small subset of the HMDB database. To generate a
database for the full HMDB we would have to download the structures.sdf file
containing all metabolites and load that file instead.
library(CompoundDb) ## Locate the file hmdb_file <- system.file("sdf/HMDB_sub.sdf.gz", package = "CompoundDb") ## Extract the data cmps <- compound_tbl_sdf(hmdb_file)
The function returns by default a (data.frame
-equivalent) tibble
(from the
tidyverse's tibble
package).
cmps
The tibble
contains columns
compound_id
: the resource-specific ID of the compound.compound_name
: the name of the compound, mostly a generic or common name.inchi
: the compound's inchi.inchi_key
: the INCHI key.formula
: the chemical formula of the compound.mass
: the compounds (monoisotopic) mass.synonyms
: a list
of aliases/synonyms for the compound.smiles
: the SMILES of the compound.To create a simple compound database, we could pass this tibble
along with
additional required metadata information to the createCompDb
function. In the
present example we want to add however also MS/MS spectrum data to the
database. We thus load below the MS/MS spectra for some of the compounds from
the respective xml files downloaded from HMDB. To this end we pass the path to
the folder in which the files are located to the msms_spectra_hmdb
function. The function identifies the xml files containing MS/MS spectra based
on their their file name and loads the respective spectrum data. The folder can
therefore also contain other files, but the xml files from HMDB should not be
renamed or the function will not recognice them. Note also that at present only
MS/MS spectrum xml files from HMDB are supported (one xml file per spectrum);
these could be downloaded from HMDB with the hmdb_all_spectra.zip file.
## Locate the folder with the xml files xml_path <- system.file("xml", package = "CompoundDb") spctra <- msms_spectra_hmdb(xml_path)
At last we have to create the metadata for the resource. The metadata
information for a CompDb
resource is crucial as it defines the origin of the
annotations and its version. This information should thus be carefully defined
by the user. Below we use the make_metadata
helper function to create a
data.frame
in the expected format. The organism should be provided in the
format e.g. "Hsapiens"
for human or "Mmusculus"
for mouse, i.e. capital
first letter followed by lower case characters without whitespaces.
metad <- make_metadata(source = "HMDB", url = "http://www.hmdb.ca", source_version = "4.0", source_date = "2017-09", organism = "Hsapiens")
With all the required data ready we create the SQLite database for the HMDB
subset. With path
we specify the path to the directory in which we want to
save the database. This defaults to the current working directory, but for this
example we save the database into a temporary folder.
db_file <- createCompDb(cmps, metadata = metad, msms_spectra = spctra, path = tempdir())
The variable db_file
is now the file name of the SQLite database. We can pass
this file name to the CompDb
function to get the CompDb
objects acting as
the interface to the database.
cmpdb <- CompDb(db_file) cmpdb
To extract all compounds from the database we can use the compounds
function. The parameter columns
allows to choose the database columns to
return. Any columns for any of the database tables are supported. To get an
overview of available database tables and their columns, the tables
function
can be used:
tables(cmpdb)
Below we extract only selected columns from the compounds table.
compounds(cmpdb, columns = c("compound_name", "formula", "mass"))
Analogously we can use the Spectra
function to extract spectrum data from the
database. The function returns by default a Spectra
object from the R
Biocpkg("Spectra")
package with all spectra metadata available as spectra variables.
library(Spectra) sps <- Spectra(cmpdb) sps
The available spectra variables for the Spectra
object can be retrieved with
spectraVariables
:
spectraVariables(sps)
Individual spectra variables can be accessed with the $
operator:
sps$collision_energy
And the actual m/z and intensity values with mz
and intensity
:
mz(sps) ## m/z of the 2nd spectrum mz(sps)[[2]]
Note that it is also possible to retrieve specific spectra, e.g. for a provided
compound, or add compound annotations to the Spectra
object. Below we use the
filter expression ~ compound_id == "HMDB0000001"
to get only MS/MS spectra for
the specified compound. In addition we ask for the "compound_name"
and
"inchi_key"
of the compound.
sps <- Spectra(cmpdb, filter = ~ compound_id == "HMDB0000001", columns = c(tables(cmpdb)$msms_spectrum, "compound_name", "inchi_key")) sps
The available spectra variables:
spectraVariables(sps)
The compound's name and INCHI key have thus also been added as spectra variables:
sps$inchi_key
To share or archive the such created CompDb
database, we can also create a
dedicated R package containing the annotation. To enable reproducible research,
each CompDb
package should contain the version of the originating data source
in its file name (which is by default extracted from the metadata of the
resource). Below we create a CompDb
package from the generated database
file. Required additional information we have to provide to the function are the
package creator/maintainer and its version.
createCompDbPackage( db_file, version = "0.0.1", author = "J Rainer", path = tempdir(), maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>")
The function creates a folder (in our case in a temporary directory) that can be
build and installed with R CMD build
and R CMD INSTALL
.
Special care should also be put on the license of the package that can be passed
with the license
parameter. The license of the package and how and if the
package can be distributed will depend also on the license of the originating
resource.
CompDb
object for MoNaMoNa (Massbank of North America) provides a large SDF file with all spectra
which can be used to create a CompDb
object with compound information and
MS/MS spectra. Note however that MoNa is organized by spectra and the annotation
of the compounds is not consistent and normalized. Spectra from the same
compound can have their own compound identified and data that e.g. can differ in
their chemical formula, precision of their exact mass or other fields.
Similar to the example above, compound annotations can be imported with the
compound_tbl_sdf
function while spectrum data can be imported with
msms_spectra_mona
. In the example below we use however the import_mona_sdf
that wraps both functions to reads both compound and spectrum data from a SDF
file without having to import the file twice. As an example we use a small
subset from a MoNa SDF file that contains only 7 spectra.
mona_sub <- system.file("sdf/MoNa_export-All_Spectra_sub.sdf.gz", package = "CompoundDb") mona_data <- import_mona_sdf(mona_sub)
As a result we get a list
with a data.frame each for compound and spectrum
information. These can be passed along to the createCompDb
function to create
the database (see below).
metad <- make_metadata(source = "MoNa", url = "http://mona.fiehnlab.ucdavis.edu/", source_version = "2018.11", source_date = "2018-11", organism = "Unspecified") mona_db_file <- createCompDb(mona_data$compound, metadata = metad, msms_spectra = mona_data$msms_spectrum, path = tempdir())
We can now load and use this database, e.g. by extracting all compounds as shown below.
mona <- CompDb(mona_db_file) compounds(mona)
As stated in the introduction of this section the compound
information
contains redundant information and the table has essentially one row per
spectrum. Feedback on how to reduce the redundancy in the compound table is
highly appreciated.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.