biodbUniprot is a biodb extension package that implements a connector to Uniprot database.
The UniProt Knowledge Base [@uniprotConsortium2016UniProtKB] can be searched using its search web service.
We present here the way to contact this web service with this package.
Install using Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install('biodbUniprot')
The first step in using biodbUniprot, is to create an instance of the biodb
class BiodbMain
from the main biodb package. This is done by calling the
constructor of the class:
mybiodb <- biodb::newInst()
During this step the configuration is set up, the cache system is initialized and extension packages are loaded.
We will see at the end of this vignette that the biodb instance needs to be
terminated with a call to the terminate()
method.
In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbUniprot implements a connector to a remote database. Here is the code to instantiate a connector:
conn <- mybiodb$getFactory()$createConn('uniprot')
To download entries, run the getEntry()
, which returns a list of BiodbEntry
objects:
entries <- conn$getEntry(c('P01011', 'P09237'))
To print the information contained in the entry objects as a data frame, run
the entriesToDataframe()
method attached to the BiodbMain
instance:
mybiodb$entriesToDataframe(entries)
The method wsSearch()
(wsQuery()
is now deprecated) implements the request
to the search web service, and the parsing of its output.
To get the raw results returned by the UniProt server, run the following code:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='plain')
The first parameter is the query, as required by the web service. To learn how to write a query for UniProt, see a description of the query web service at http://www.uniprot.org/help/api_queries.
The fields
parameter is the fields you want back for each entry
returned by the database.
The size
parameter is the maximum number of entries the server must
return.
The retfmt
parameter controls the type of output desired.
Here "plain"
states that we want the raw output from the server.
To get the output parsed by biodb and get a data frame, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='parsed')
To get only the list of UniProt identifiers, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='ids')
And if you are curious to see the URL request that is sent to the server, run:
conn$wsSearch('reviewed:true AND organism_id:9606', fields=c('accession', 'id'), size=2, retfmt='request')
The method geneSymbolToUniprotIds()
uses wsSearch()
to search for UniProt
entries that reference particular gene symbols.
For instance, if you want to get the UniProt entries that have the gene symbol G-CSF, just run:
ids <- conn$geneSymbolToUniprotIds('G-CSF') mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
If you want to match also GCSF (no minus sign character), then run:
ids <- conn$geneSymbolToUniprotIds('G-CSF', ignore.nonalphanum=TRUE) mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
If you want to match G-CSFa2 too, run:
ids <- conn$geneSymbolToUniprotIds('G-CSF', partial.match=TRUE) mybiodb$entryIdsToDataframe(ids[['G-CSF']], 'uniprot', fields=c('accession', 'gene.symbol'))
The way this method works is by running wsSearch()
to get a first set of entry
identifiers, and then download each entry and apply a filtering on them.
The downloading of the entries may quite long, wsSearch()
returning
potentially thousands of entries, each entry being downloaded with a single
separate request and the frequency limit being 3 request per second.
Entries already in cache or memory will not be downloaded again, so running the
same request a second time will be faster, as it is usually the case with
biodb.
When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):
mybiodb$terminate()
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.