BiocStyle::markdown()
Package: r Biocpkg("PSMatch")
Authors: r packageDescription("PSMatch")[["Author"]]
Last modified: r file.info("PSM.Rmd")$mtime
Compiled: r date()
library("PSMatch")
To install the package from Bioconductor, make sure you have the
BiocManager
package, available from CRAN, and then run
BiocManager::install("PSMatch")
This vignette is one among several illustrating how to use the
PSMatch
package, focusing on the handling and processing of
proteomics identification data using the PSM
class. For a general
overview of the package, see the PSMatch
package manual page
(?PSMatch
) and references therein.
We are going to use an mzid
file from the msdata
package.
f <- msdata::ident(full.names = TRUE, pattern = "TMT") basename(f)
The PSM()
function parses one or multiple mzid
files and returns
an object of class PSM
. This class is a simple extension of the
DFrame
class, representing the peptide-spectrum matches in a tabular
format.
library("PSMatch") id <- PSM(f) id
n_matches <- nrow(id) n_scans <- length(unique(id$spectrumID)) n_seqs <- length(unique(id$sequence)) n_cols <- ncol(id)
This table contains r n_matches
matches for r n_scans
scans and
r n_seqs
peptides sequences, each annotated by r n_cols
variables.
nrow(id) ## number of matches length(unique(id$spectrumID)) ## number of scans length(unique(id$sequence)) ## number of peptide sequences names(id)
The PSM data are read as is, without any filtering. As we can see below, we still have all the hits from the forward and reverse (decoy) databases.
table(id$isDecoy)
The data also contains multiple matches for several spectra. The table below shows the number of individual MS scans that have 1, 2, ... up to 5 matches.
table(table(id$spectrumID))
More specifically, we can see below how scan 1774 has 4 matches, all
to sequence RTRYQAEVR
, which itself matches to 4 different proteins:
i <- grep("scan=1774", id$spectrumID) id[i, ] id[i, "DatabaseAccess"]
If the goal is to keep all the matches, but arranged by scan/spectrum,
one can reduce the DataFrame
object by the spectrumID
variable,
so that each scan correponds to a single row that still stores all
values[^rownames]:
[^rownames]: The rownames aren't needed here and are removed to reduce
to output in the the next code chunks displaying parts of id2
.
id2 <- reducePSMs(id, id$spectrumID) rownames(id2) <- NULL ## rownames not needed here dim(id2)
The resulting object contains a single entrie for scan 1774 with information for the multiple matches stored as a list within the table cell.
j <- grep("scan=1774", id2$spectrumID) id2[j, ]
id2[j, "DatabaseAccess"]
The identification data could be used to annotate an raw mass
spectrometry Spectra
object (see the Spectra::joinSpectraData()
function for details).
Often, the PSM data is filtered to only retain reliable matches. The
MSnID
package can be used to set thresholds to attain user-defined
PSM, peptide or protein-level FDRs. Here, we will simply filter out
wrong or the least reliable identifications.
id <- filterPsmDecoy(id) id
id <- filterPsmRank(id) id
The data still contains shared peptides, i.e. those that different
proteins. For example QKTRCATRAFKANKGRAR
matches proteins ECA2869
and ECA4278
.
id[id$sequence == "QKTRCATRAFKANKGRAR", "DatabaseAccess"]
We can filter these out to retain unique peptides.
id <- filterPsmShared(id) id
This last filtering leaves us with r nrow(id)
PSMs.
Note that the ConnectedComponents
approach defined in this package
allows one to explore protein groups defined by such shared peptides
more thoroughly and make informed decision as to which shared peptides
to retain and which ones to drop. For details see the related vignette
or the
ConnectedComponents
manual page.
This can also be achieved with the filterPSMs()
function:
id <- PSM(f) filterPSMs(id)
mzR
and mzID
parsersThe PSM()
function can take two different values for the parser
parameter, namely "mzR"
(the default value) and "mzID"
.
mzR uses the openIDfile()
function from the
r BiocStyle::Biocpkg("mzR")
to parse the mzId
file(s), and then
coerces the data to a data.frame
which is eventually returned as
a PSM
object. The parser function uses dedicated code from the
Proteowizard project (included in mzR
) and is generally the
fastest approach.
mzID parses the mzId
file with mzID()
function from the
r BiocStyle::Biocpkg("mzID")
package, and then flattens the data to
a data.frame
with mzID::flatten()
and eventuelly returns a
PSM
object. The mzID
package relies on the r BiocStyle::CRANpkg("XML")
package. Is is slower but is is more robust to variations in the
mzID
implementation, as is thus a useful backup when the mzR
backend fails.
system.time(id1 <- PSM(f, parser = "mzR")) system.time(id2 <- PSM(f, parser = "mzID"))
Other differences in the two parsers include the columns that are
returned, the way they name them, and, as will shown below the matches
that are returned. Note for instance (and this will be important
later), that there is no equivalent of "modLocation"
in id2
.
names(id1) names(id2)
We also have different number of matches in the two tables:
nrow(id1) nrow(id2)
table(id1$isDecoy) table(id2$isdecoy)
Let's first filter the PSM tables to facilitate focus the comparison
of relevant scans. Note that the default filterPSMs()
arguments are
set to work with both parser.
id1_filtered <- filterPSMs(id1) id2_filtered <- filterPSMs(id2)
As can be seen, we are also left with r nrow(id1_filtered)
vs
r nrow(id2_filtered)
PSMs after filtering.
The difference doesn't stem from different scans, given that the spectum identifiers are identical in both tables:
identical(sort(unique(id1_filtered$spectrumID)), sort(unique(id2_filtered$spectrumid)))
The difference is obvious when we tally a table of spectrum id
occurences in the filtered tables. In id2_filtered
, each scan is
unique, i.e matched only once.
anyDuplicated(id2_filtered$spectrumid)
However, for id1_filtered
, we see that some scans are still repeat
up to 4 times in the table:
table(table(id1_filtered$spectrumID))
The example below shows that these differences stem from the
modification location ("modLocation"
), that is not report by the
mzID
parser:
k <- names(which(table(id1_filtered$spectrumID) == 4)) id1_filtered[id1_filtered$spectrumID == k, "sequence"] id1_filtered[id1_filtered$spectrumID == k, "modLocation"] id1_filtered[id1_filtered$spectrumID == k, "modName"]
If we remove the "modLocation"
column, we recoved the same number of
PSMs than with the mzID
parser.
id1_filtered$modLocation <- NULL nrow(unique(id1_filtered)) nrow(unique(id2_filtered))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.