BiocStyle::markdown()
Package: Pbase
Author: Laurent Gatto
Last compiled: r date()
Last modified: r file.info("mapping.Rmd")$mtime
library("Pbase")
The aim of this vignette is to document the mapping of proteins and the tandem mass spectrometry-derived peptides to genomic locations.
Pbase:::mapplot()
We will use a small Proteins
object from the r Biocpkg("Pbase")
package to illustrate how to retrieve genome coordinates and map a
peptides back to genomic coordinates. See the Pbase-data
vignette
for an introduction to Proteins
data. The main information needed in
this vignette consists of protein UniProt identifiers and a set of
peptides positions along the protein sequence.
library("Pbase") data(p) p seqnames(p)
pcols(p)[1, c("start", "end")]
We will also require an identifier relating the protein feature of
interest to the genome. Below, we use r Biocpkg("biomaRt")
to query
the matching Ensembl transcript identifier. We start by create a
Mart
object that stores the connection details to the latest human
Ensembl biomart server.
library("biomaRt") ens <- useMart("ensembl", "hsapiens_gene_ensembl") ens bm <- select(ens, keys = seqnames(p), keytype = "uniprot_swissprot", columns = c( "uniprot_swissprot", "hgnc_symbol", "ensembl_transcript_id")) bm
As can be seen, there can be multiple transcripts for one protein
accession. We have defined the transcripts of interest for our
proteins in p
; they are stored as protein elements metadata:
acols(p)$ENST
The etrid2grl
function takes our transcript identifiers and will
query the Ensembl biomart server (note the ens
argument) and return
a GRangesList
object. For each of Ensembl transcript identifiers
provided as input, we have the genomic coordinates of that
transcript's exons as well as additional information such as the type
of exons (protein coding or untranslated region).
grl <- etrid2grl(acols(p)$ENST, ens) all.equal(names(grl), acols(p)$ENST, check.attributes=FALSE) grl
We also need to retain only coding exons and discard untranslated
regions for later peptide mapping, using the proteinCoding
function.
pcgrl <- proteinCoding(grl) pcgrl
r Biocpkg("Gviz")
and r Biocpkg("Pviz")
pp <- p[5] n <- elementNROWS(pranges(pp)) pepstring <- paste(unique(pcols(pp)[[1]]$pepseq), collapse = ", ")
The peptides that have been experimentally observed are available as
ranges (coordinates) along the protein sequences. For example, below,
we isolate and visualise the r n
peptides (r pepstring
) have been
identified for our protein of interest r seqnames(p)[5]
.
sort(pranges(p)[5]) plot(p[5])
We can also plot the transcript regions inluding (grl
) or exclusing
(pcgrl
) the untranslated regions.
plotAsGeneRegionTrack(grl[[5]], pcgrl[[5]])
The aim of this document is to document the mapping of peptides, i.e. ranges along a protein sequence to ranges along the genome reference. In other words, our aim is the convert protein coordinates to genome coordinates.
The first check that we want to implement is to verify that we can regenerate the protein amino acid sequence from the genome regions that we have extracted.
We also need the actual genome sequence (so far, we have only dealt
with regions and features). The exons coordinates have been retrieved
from the latest Ensembl release, which is based on the human genome
assembly GRCh38
. We will use a genome package that is based on the
same reference genome, namely r Biocannopkg("BSgenome.Hsapiens.NCBI.GRCh38")
.
We need to make sure that the chromosomes are named the same way in
the genome sequence data and our genomics ranges ("chrX"
, as seen
above).
library("BSgenome.Hsapiens.NCBI.GRCh38") head(seqnames(BSgenome.Hsapiens.NCBI.GRCh38)) if (!"chr1" %in% seqnames(BSgenome.Hsapiens.NCBI.GRCh38)) seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23] <- paste0("chr", seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23]) seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[21:27]
Once we have extracted the actual sequences, we must also make sure
that we we reverse the sequences in case out genomic features are on
the reverse strand. We the combine (unlist
) the exons (coding
sequences only, pcgrl
) and translate then into a protein sequence.
s <- getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl[[5]]) s if (isReverse(pcgrl[[5]])) s <- rev(s) aaseq <- translate(unlist(s)) aaseq
We verify that the translated genome sequence and the protein squence we started with match by aligning them.
writePairwiseAlignments(pairwiseAlignment(aa(p[5]), aaseq))
We can now calculate the peptide coordinate along the genome using the
position of the peptides along the protein (in p
) and the position
of the exons of the protein's transcript along the genome (in pcgrl
)
using the mapToGenome
function.
res <- mapToGenome(p[5], pcgrl[5]) res[[1]]
Based on the new peptide genomic coordinates, it is now
straightforward to create a new AnnotationTrack
and add it the the
track visualisation.
plotAsAnnotationTrack(res[[1]], grl[[5]])
Finally, we customise the figure by adding a track with the $MS^2$
spectra. The raw data used to search the protein database an create
p
are available as an MSnExp
object.
data(pms) library("ggplot2") details <- function(identifier, ...) { p <- plot(pms[[as.numeric(identifier)]], full=TRUE, plot=FALSE) + ggtitle("") p <- p + theme_bw() + theme(axis.text.y = element_blank(), axis.text.x = element_blank()) + labs(x = NULL, y = NULL) print(p, newpage=FALSE) } res <- res[[1]] deTrack <- AnnotationTrack(start = start(res), end = end(res), genome = "hg38", chromosom = "chrX", id = pcols(p)[[5]]$acquisitionNum, name = "MS2 spectra", stacking = "squish", fun = details) grTrack <- GeneRegionTrack(grl[[5]], name = acols(p)$ENST[5]) ideoTrack <- IdeogramTrack(genome = "hg38", chromosome = "chrX") axisTrack <- GenomeAxisTrack() plotTracks(list(ideoTrack, axisTrack, deTrack, grTrack), add53 = TRUE, add35 = TRUE)
Proteins
objectAbove, we have demonstrated the mapToGenome
functionality on one
protein only. The same operation can be performed on all the r length(p)
of the p
object using the pmapToGenome
equivalent using
the r length(pcgrl)
ranges calculated with etrid2grl
. The pmapToGenome
will map the peptides of i-th protein to the i-th genomic location of pcgrl
.
pres <- pmapToGenome(p, pcgrl) pres
k <- 6 seqnames(p)[k]
In the code chunk below, we remind ourselves that, querying the
Ensembl Biomart server for r seqnames(p)[k]
, we obtain several
possible transcript identifiers, including the identifier of interest
r acols(p)$ENST[k]
.
sel <- bm$uniprot_swissprot == seqnames(p)[k] bm[sel, ] acols(p)$ENST[k]
Let's fetch the coordinates of all possible transcipts, making sure
that the names of the Ensembl identifiers are used to name the grl
ranges (using use.names = TRUE
).
eid <- bm[sel, 3] names(eid) <- bm[sel, 1] eid grl <- etrid2grl(eid, ens, use.names = TRUE) pcgrl <- proteinCoding(grl)
plotAsGeneRegionTrack(pcgrl)
We extract the transcript sequences, translate them into protein
sequences and align each to our protein sequence (originally imported
from the fasta database, see ?Proteins
for the construction of p
).
lseq <- lapply(getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl), function(s) translate(unlist(s))) laln <- sapply(lseq, pairwiseAlignment, aa(p[k])) sapply(laln, nmatch)/width(aa(p[k]))
ki <- which.max(sapply(laln, nmatch))
We see that transcript number r ki
, r eid[ki]
, perfectly aligns
with our protein sequence. This is also the transcipt that corresponds
to the curated Ensembl transcript in acols(p)$ENST
.
ki <- which.max(sapply(laln, nmatch)) stopifnot(eid[ki] == acols(p)$ENST[k])
res <- pmapToGenome(p[k], pcgrl[ki])
As shown on the next figure, peptides that span over exon junctions are grouped together and, below, colour-coded.
plotAsAnnotationTrack(res[[1]], pcgrl[[ki]])
One can also apply a many-to-one mapping approach to all proteins in
the p
object and all the transcripts identifiers fetched with
etrid2grl
as shown below.
alleid <- bm[, 3] names(alleid) <- bm[, 1] grl <- etrid2grl(alleid, ens, use.names = TRUE) pcgrl <- proteinCoding(grl) res <- mapToGenome(p, pcgrl) length(res)
The messages indicate that one protein accession number was not found
in the pcgrl
ranges (no transcript was found) and several mapping
failed. In total, we obtain r length(res)
mapping for
r length(unique(names(pcgrl)))
protein accession numbers.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.