An R package for connecting to chemical and biological databases.
biodb is a framework for developing database connectors. It is delivered with some non-remote connectors (for CSV file or SQLite db), but the main interest of the package is to ease development of your own connectors. Some connectors are already available in other packages (e.g.: biodbChebi, biodbHmdb, biodbKegg, biodbLipidmaps, biodbUniprot) on GitHub. For now, the targeted databases are the ones that store molecules, proteins, lipids and MS spectra. However other type of databases (NMR database for instance) could also be targeted.
With biodb you can:
Install the latest stable version using Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install('biodb')
You can install the latest development version of biodb from GitHub:
install.packages('devtools')
devtools::install_github('pkrog/biodb', dependencies=TRUE)
Alongside biodb you can install the following R extension packages that use biodb for implementing connectors to online databases:
Installation of one of those extension packages can be done with the following command (replace 'biodbKegg' with the name of the wanted package):
devtools::install_github('pkrog/biodbKegg', dependencies=TRUE)
biodb is part of Bioconda, so you can install it using Conda. This means also that it is possible to install it automatically in Galaxy, for a tool, if the Conda system is enabled.
The biodb package contains the following in-house database connectors:
Here are some of the fields accessible through the retrieved entries (more fields are defined in extension packages):
Here is an example on how to retrieve entries from ChEBI database and get a data frames of them (you must first install both biodb and biodbChebi packages):
bdb <- boidb::newInst()
chebi <- bdb$getFactory()$createConn('chebi')
entries <- chebi$getEntry(c('2528', '7799', '15440'))
bdb$entriesToDataframe(entries)
All compound databases (ChEBI, Compound CSV File, KEGG Compound, ...) can be searched for compounds using the same function. Once you have your connector instance, you just have to call searchCompound()
on it:
myconn$searchCompound(name='phosphate')
The function will return a character vector containing all identifiers of matching entries.
It is also possible to search by mass, choosing the mass field you want (if this mass particular field is handled by the database):
myconn$searchCompound(mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)
Searching by both name and mass is also possible.
myconn$searchCompound(name='phosphate', mass=230.02, mass.field='monoisotopic.mass', mass.tol=0.01)
All mass spectra databases (Mass CSV File and Mass SQLite) can be searched for mass spectra using the same function searchMsEntries()
:
myconn$searchMsEntries(mz.min=40, mz.max=41)
The function will return a character vector containing all identifiers of matching entries (i.e.: spectra containing at least one peak inside this M/Z range).
Annotating a mass spectrum can be done either using a mass spectra database or a compound database.
When using a mass spectra database, the function to call is searchMsPeaks()
:
myMassConn$searchMsPeaks(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')
It returns a new data frame containing the annotations.
When using a compound database, the function to call is annotateMzValues()
:
myCompoundConn$annotateMzValues(myInputDataFrame, mz.tol=0.1, mz.tol.unit='plain', ms.mode='neg')
It returns a new data frame containing the annotations.
Defining a new field for a database is done in two steps, using definitions written inside a YAML file.
First we define the new field. Here we define the ChEBI database field for stars indicator (quality curation indicator):
fields:
n_stars:
description: The ChEBI example stars indicator.
class: integer
Then we define the parsing expression to use in ChEBI connector in order to parse the field's value:
databases:
chebi:
parsing.expr:
n_stars: //chebi:return/chebi:entityStar
We now have just to load the YAML file definition into biodb (in extension packages, this is done automatically):
mybiodb$loadDefinitions('my_definitions.yml')
Parsing may be more complex for some fields or databases. In that case it is possible to write specific code in the database entry class for parsing these fields.
Defining a new connector is done by writing two RC classes and a YAML definition:
* An RC class for the connector, named MyDatabaseConn.R
.
* An RC class for the entry, named MyDatabaseEntry.R
.
* A definition YAML file containing metadata about the new connector, like:
+ The URLs (main URL, web service base URL, etc.) for a remote database.
+ The timing for querying a remote database (maximum number of requests per second).
+ The name.
+ The parsing expressions used for parsing the entry fields.
+ The type of content retrieved from the database when downloading an entry (plain text, XML, HTML, JSON, ...).
For a good starting example of defining a new remote connector, see biodbChebi the ChEBI extension for biodb at https://github.com/pkrog/biodbChebi. In particular: * The connector class. * The entry class. * The definitions file.
A set of classes and methods are provided by biodb to generate a skeleton of
a new repository for a new connector. The easiest way to use this feature is
through the method biodb::genNewExtPkg()
.
Here is an example which creates an new repository for a new connector to the
Foo remote database on how to use it with some comments:
biodb::genNewExtPkg(
path = 'the/path/to/biodbFoo', # The repository folder.
# pkgName = 'myName', # By default the laste folder of `path` is used
# so you do not need to modify it.
email = 'your@e.mail', # The author's email.
dbName = 'foo.db', # The connector name that will be used by biodb.
dbTitle = 'Foo database', # A short description of the connector's database.
# pkgLicense = '...', # The generated license is always AGPL-3.
firstname = 'Your firstname',
lastname = 'Your lastname',
connType = 'compound', # Use 'mass' for an MS database or 'plain' for any
# other type. Run `biodb::getConnTypes()` to get a
# full list of all available types.
entryType = 'txt', # Other possible types are: 'plain', 'csv',
# 'html', 'json', 'list', 'sdf' and 'xml'.
# Run `biodb::getEntryTypes()` to get a full list
# of all available types.
editable = FALSE, # If the database is editable in memory.
writable = FALSE, # If the database is writable on disk (like a CSV
# file).
remote = TRUE, # If the database is accessed through web protocol
# like HTTPS, as oppose to local database stored
# inside an SQLite file or a CSV file.
downloadable = FALSE, # Set it to TRUE for a remote database that allows
# the download of its full content (e.g.: through
# the download of a zip file).
makefile = TRUE, # Generate a Makefile file, useful for maintenance
# UNIX/Linux systems.
rcpp = FALSE, # If set to TRUE, the package will be configured
# to use Rcpp and skeleton files will be generated
# with examples and test examples.
# vignetteName = '...', # By default the vignette name will be the package
# name.
githubRepos = 'id/repos' # The repository URL on GitHub (e.g.:
# 'pkrog/biodbChebi').
)
Once in R, you can get an introduction to the package with:
?biodb
Then each class has its own documentation. For instance, to get help about the
BiodbFactory
class:
?biodb::BiodbFactory
Several vignettes are also available. To get a list of them run:
vignette(package='biodb')
To open a vignette in a browser, use its name:
vignette('new_connector', package='biodb')
If you wish to contribute to the biodb package, you first need to create an account under GitHub. You can then either ask to become a contributor or fork the project and submit a merge request.
Debugging, enhancement or creation of a database connector or an entry parser are of course most welcome.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.