BiocStyle::markdown()
Package: Pbase
Author: Johannes Rainer, Laurent Gatto
Last compiled: r date()
Last modified: r file.info("Pbase-with-ensembldb.Rmd")$mtime
library("Pbase") library("ensembldb")
This vignette describes how the r Biocpkg("ensembldb")
package can be used to
retrieve protein annotations and to map peptides within protein sequences to the
genomic coordinates. EnsDb
databases and packages created by the ensembldb
package version > 1.99.0 can provide protein annotations, but don't necessarily
have too. Only EnsDb
packages created using the Ensembl Perl API contain
protein annotations, while databases created from GTF or GFF files or from
GRanges
objects or using the r Biocpkg("AnnotationHub")
don't.
Below we load the EnsDb
object with for all human genes defined in the Ensembl
database version 86 and use the hasProteinData
method to evaluate whether
protein annotations are available.
library(ensembldb) library(EnsDb.Hsapiens.v86) ## Make a shortcut to the object edb <- EnsDb.Hsapiens.v86 hasProteinData(edb)
ensembldb
provides the proteins
method to fetch protein annotations from an
EnsDb
database (along eventual transcript or gene annotations) and, depending
on the value of the return.type
parameter, return the results as a
data.frame
, DataFrame
or AAStringSet
. Below we use the function to
retrieve all proteins encoded by the gene ZBTB16. We use a GeneNameFilter
to
select entries for that gene. In the simple example below only protein
annotations were retrieved (i.e. the Ensembl protein ID, the amino acid
sequence, the ID of the transcript encoding the protein and the gene name; the
latter two stored in the AAStringSet
's mcols
), but EnsDb
s provide in
addition also the mapping between Ensembl protein IDs and Uniprot IDs and all
protein domains within the protein sequence. For more information on available
filters, methods and annotation columns see the ensembldb
vignette.
prts <- proteins(edb, filter = GeneNameFilter("ZBTB16"), return.type = "AAStringSet") prts ## Get access to the additional annotation data mcols(prts)
In Pbase
, a Proteins
method is also implemented for EnsDb
objects that
enables to load a Proteins
object from an EnsDb
database and optionally load
all protein domains as peptide ranges in the pranges
slot (using the
loadProteinDomains
parameter which is by default TRUE
). Below we use also a
filter expression in form of a formula
instead of an AnnotationFilter
class.
library(Pbase) zbtb16 <- Proteins(edb, filter = ~ gene_name == "ZBTB16")
The proteins' sequences are stored in the aa
slot and the protein domains in
the pranges
slot:
## Get the protein sequences. aa(zbtb16) ## Get the protein domains stored as peptide features. pranges(zbtb16)
Also the Proteins
method supports to retrieve additional annotation columns
from the database. Below we repeat the call but fetch in addition also the
Uniprot IDs for the proteins. With that a property of the provided annotation
becomes apparent: while the mapping between Ensembl transcript ID and Ensembl
protein ID is always 1:1, each Ensembl protein can be annotated to more than one
Uniprot ID and each Uniprot ID can be assigned one or more Ensembl protein IDs.
## Loading in addition Uniprot ID annotations zbtb16_2 <- Proteins(edb, columns = c("uniprot_id", "uniprot_db", "uniprot_mapping_type"), filter = GeneNameFilter("ZBTB16")) acols(zbtb16_2)
As we can see two transcript IDs (ENST00000335953 and ENST00000392996) and hence
Ensembl protein IDs are each annotated to two Uniprot IDs. To avoid some of
these multi-mappings, we can use an UniprotMappingTypeFilter
to retrieve only
presumably higher quality annotations by selecting only Uniprot IDs mapped to
Ensembl protein IDs using the "DIRECT"
mapping type. We get hence again unique
Ensembl protein IDs respectively transcript IDs.
acols(Proteins(edb, columns = c("uniprot_id", "uniprot_db"), filter = AnnotationFilterList( GeneNameFilter("ZBTB16"), UniprotMappingTypeFilter("DIRECT"))))
Note that without specifying a filter we can also retrieve all proteins from the database.
EnsDb
databases provide both, protein annotations and gene/transcript
annotations including their genomic coordinates. Peptide features along amino
acid sequences within a Proteins
objects can hence be mapped to the genome
directly using the mapToGenome
method by providing an EnsDb
instance with
the genome
parameter. The Proteins
object has to provide IDs that allow to
identify the encoding transcripts. Supported IDs are Ensembl protein ID (idType
= "protein_id"
), transcript ID (idType = "tx_id"
) or Uniprot ID (idType =
"uniprot_id"
) which can be provided either as the names
of the Proteins
object, or in one of the acols
metadata columns.
In the example below we use the Proteins
object with all proteins of the gene
ZBTB16 that we fetched from the database in the previous section. To identify
the encoding transcripts, we use the Ensembl protein IDs that are provided with
the names of the object.
## We use the Ensembl protein IDs to identify the transcripts in the database names(zbtb16) ## Map all peptide features within the object to the genome. zbtb16_map <- mapToGenome(zbtb16, edb, idType = "protein_id", id = "name") zbtb16_map
The result is a GRangesList
one element for each protein with GRanges
for
the genomic positions of the mapped peptide features (in our example protein
domains within the protein sequence). Below we plot the genomic alignments for
the protein domains of the first protein.
library(Gviz) ## Define a genome axis track gat <- GenomeAxisTrack() ## Get the transcript ID of the first transcript: txid <- acols(zbtb16)$tx_id[1] ## Get a GRanges for the first transcript trt <- getGeneRegionTrackForGviz(edb, filter = TxIdFilter(txid)) ## Add the protein domain name as column "id" to the mcols so it ## can be used in the plot to identify the features. map_1 <- zbtb16_map[[1]] map_1$id <- names(map_1) ## Plotting the transcript and the mapped protein domains. plotTracks(list(gat, GeneRegionTrack(trt, name = "tx"), AnnotationTrack(map_1, groupAnnotation = "id", just.group = "above", name = "Protein domains")), transcriptAnnotation = "transcript")
While the mapToGenome,Proteins,EnsDb
method used above performs all of the
mapping and supports mapping of multiple proteins, we could also fetch the
genomic positions of the encoding transcripts' CDS using the cdsBy
method and
provide the resulting GRangesList
to the mapToGenome,Proteins,GRangesList
method.
In addition to Ensembl transcript or protein IDs, it is also possible to provide
Uniprot IDs. While the mapping between Ensembl protein IDs and transcript IDs is
1:1, multiple Ensembl protein IDs can be assigned to a single Uniprot ID hence
the mappoing between Uniprot IDs and Ensembl transcript IDs is also 1:n. In
cases in which a Uniprot ID is annotated to multiple transcript IDs, the
mapToGenome
method will select automatically the best suited transcript by
comparing the length of the protein sequence with the length of the transcripts'
coding sequences. In the example below we use the test Proteins
data from the
Pbase
package and perform the mapping of the peptide features using the
provided Uniprot IDs of the proteins.
data(p) ## We use the Uniprot IDs provided as names for the mapping names(p) res <- mapToGenome(p, edb, idType = "uniprot_id")
Apart from one of the Uniprot IDs, P04075-2, for which no corresponding transcript could be identified, all peptide features could be mapped successfully. As detailed above, the method did select for each Uniprot ID the transcript with the best matching CDS. To illustrate this, we select below all transcripts that are annotated to the first Uniprot ID.
txs <- transcripts(edb, filter = UniprotFilter(names(p)[1])) names(txs)
While there are 6 transcripts annotated to the Uniprot ID from only one the coding sequence matches the length of the protein:
## Fetch the CDS for each transcript cdss <- cdsBy(edb, filter = TxIdFilter(names(txs))) ## The protein sequence length: width(aa(p)[1]) ## The difference between protein sequence and CDS length width(aa(p)[1]) - sum(width(cdss)) / 3
Thus, the lenght of the coding region of the last transcript matches the length of the protein's amino acid sequence (the stop codon, being part of the CDS, is not coding for an amino acid).
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.