knitr::opts_chunk$set( #collapse = TRUE, comment = "#>", fig.width = 4, fig.height = 4, message = FALSE, warning = FALSE, tidy.opts = list( keep.blank.line = TRUE, width.cutoff = 150 ), options(width = 150), eval = TRUE )
\newpage
This vignette provides an overview of using the protlocassign
package for
constrained proportional assignment. A more detailed guide is available in a
companion set of seven tutorials in the package "protlocassigndoc";
details on installing that are given in the next section.
Additional details may be found in a companion paper
(DF Moore, D Sleat, and P Lobel, 2022).
Determining the locations of proteins in the cell is an important but complex problem. A frequently employed approach for this involves centrifugation-based methods to partially separate different organelles and other cellular compartments and then to determine the relative distribution of different proteins among the centrifugation fractions. The location of proteins of interest in the cellular compartments are then inferred by comparing their distributions across the centrifugation fractions to the distributions of a set of reference proteins (markers) with known cellular locations.
This package implements a subcellular protein assignment procedure known as “constrained proportional assignment”, or CPA (Jadot et al, 2017). The basic concepts involved in CPA and analysis of subcellular proteomics data are described in the main paper (Moore et al., 2022). This vignette serves as a basis for implementing CPA and various utilities for both experienced and beginner R users.
There are two fundamental inputs for CPA. The first is a file with information regarding the distribution of each protein or other species of interest to be analyzed across centrifugation fractions. Each row has a name that serves as a unique identifier, which we here refer to as a protein name, but one can use other identifiers (e.g., protein group, protein isoform, gene name, accession number, etc.). For each protein (or other identifier), there are data that reflect its distribution among the centrifugal fractions, which represent a profile. The second is a file containing a list of single-compartment reference proteins (markers) and their associated subcellular locations. As an initial step, the package uses the markers to compute profiles for individual cellular compartments. Then, for each protein, it finds the best match for its profile using a linear combination of compartment profiles. The relative weights of the linear combination in principle reflect the relative abundance of a given protein among different compartments. The CPA method can thus account for proteins that have multiple residences, and estimate the relative proportion among these residences.
We illustrate with an example. Tannous et al. (2020) presented an experiment (designated here as AT5) analyzing abundance levels of proteins across a total of nine fractions: six fractions (N, M, L1, L2, P, and S) from differential centrifugation of a rat liver homogenate and three fractions (Nyc1, Nyc2, and Nyc3) from a Nycodenz density gradient centrifugation of the differential fraction L1. Eight subcellular compartments were considered: nucleus (Nuc), mitochondria (Mito), lysosomes (Lyso), peroxisomes (Perox), endoplasmic reticulum (ER), Golgi apparatus (Golgi), plasma membrane (PM), and cytosol (Cyto). The CPA method assigns each protein to one or more of these compartments, based on profiles from the set of reference proteins.
Other Bioconductor packages are available for working with proteomics data. The "HPAanalyze" package (Tran et al., 2019) provides tools for accessing the Human Protein Atlas and for visualizing the data. The "pRoloc" package (Breckels et al.) provides tools for studying localisation of proteins inside cells and for manipulating and visualizing relevant data from quantitative fractionation experiments. In this vignette and in the tutorials in the accompanying "protlocassigndoc" package we provide detailed instructions on how to format data in ".csv" files for import in this package. Data stored in "pRoloc" or "hpa" R formats may be converted into the formats required by "protlocassign" in a similar fashion.
First install the Bioconductor utility BiocManager
from CRAN:
install.packages("BiocManager")
Next, install the protlocassign
package from Bioconductor:
BiocManager::install("protlocassign")
Alternatively one may install a development version directly from Github as follows:
install.packages("devtools") devtools::install_github("mooredf22/protlocassign")
An extensive set of companion supplementary tutorials may be installed directly from github:
devtools::install_github("mooredf22/protlocassigndoc")
Finally, install other required packages:
BiocManager::install(c("BB", "pracma", "lme4", "outliers", "BiocParallel"))
Once you have installed these packages, you do not need to re-install
them the next time you start up R. However, you will need to load them
using the library
function before use.
To use this package, you will need two data sets as described above. One, the protein profile data set, contains rows with a unique identifier followed by data describing the profile associated with each identifier across a series of centrifugation fractions. In this tutorial, the profiles associated with each identifier are specific amounts with sums constrained to 1 (NSA, see Moore et al., 2002 and Tutorial 3) but can be in a different form or further transformed to improve the quality of the fit (see Tutorials 2-4). The protein profile data set may contain two additional values representing the numbers of peptides and spectra that were used to compute the profile (see below). The other data set consists of the list of reference proteins and their associated known subcellular compartments.
Consider for example a test data set drawn from the TMT MS2 data from Tannous et al. (2020) experiment AT5. The full data set has 7894 proteins
but we use a small subset of these for purposes of illustration and to reduce processing time. To get started, load the package and attach the embedded test protein profile data set; we see that it has 40 rows and 11 columns:
library(protlocassign) data(protNSA_test) dim(protNSA_test)
For the sake of brevity, we rename it protNSA
:
protNSA <- protNSA_test
The first few rows of the data can be examined using the head
command,
rounding to improve legibility. The first nine columns represent the
protein profile. The last two columns, which are optional, give the
number of spectra and the number of peptides (sequences) for each protein.
Note that while we use a nested random effects model described in
Jadot et al., 2017 to compute the means across spectra
(also see Tutorial 6), other methods, including taking a straight average
or weighted average (e.g., based on reporter ion intensities or peak areas),
may be appropriate for other applications.
round(head(protNSA), digits=2)
We can also use the “str” command as an alternate way to examine the structure of the protNSA data set. The first line returned indicates that protNSA is represented in R as a data frame with 40 rows and 11 variables. The remaining lines show each variable labeled according to its name (column name) followed by its type (here “num”) and the first few values, omitting row names that are present within the data frame.
str(protNSA)
The list containing reference proteins is derived from Jadot et al (2017) and examining the dimensions of the data reveals that it contains 37 rows and two columns, the first for the protein names and the second for their respective subcellular compartments.
data(markerListJadot) dim(markerListJadot)
The data set can be viewed by entering its name as follows:
markerListJadot
While the protNSA_test
and markerListJadot
data sets are included
in the package, they were initially read into R from external files and
automatically converted to data frames. This will need to be done for
any new data set. As an example, we demonstrate this for a case where one
is working with the Windows operating system and the data sets reside in
the directory C:\temp\myproteindata
.
(If one is working with either the Linux or Mac OS, appropriate changes
to these procedures will be needed to access files.)
One can set the working directory to point there, either by navigating
to it in R studio (session menu pane) or by entering:
setwd("C:\\temp\\myproteindata")
Here, for for illustration, we shall instead create a temporary directory as follows:
currentDir <- tempdir() setwd(currentDir) currentDir
To illustrate how to read in the protein profiles
and list of marker proteins, we first write out
our protein profiles and marker list as comma-delimited files.
For the protein profiles, we write out the protein names
(which are row names in R) by specifying row.names=TRUE
.
Note that we first alter the protein names so that they are
preceded by a single quote since, if one subsequently opens
the file with Microsoft Excel, it could automatically reformat some names
to dates (e.g., March1 to 1-Mar).
To write protein profiles and marker proteins to .csv
files, input:
setwd(currentDir) protNSAout <- protNSA rownames(protNSAout) <- paste("'", rownames(protNSA), sep="") markerListJadotOut <- markerListJadot markerListJadotOut$protName <- paste("'", markerListJadot$protName, sep="") write.csv(protNSAout, file="protNSAout.csv", row.names=TRUE) write.csv(markerListJadotOut, file="markerListJadotOut.csv", row.names=FALSE)
We may examine these files by importing them into Excel. For the marker protein data file, the first row must contain the two column names (protein unique identifier and compartment designation). Note that the marker proteins must be specified precisely as listed in the protein profile data set.
We then read in the two data sets
(protein profile data and reference protein list),
which must be in .csv
format, with the first row containing column names.
The option row.names=1
takes the first column of
the protNSAout.csv
file and uses it as row names
for the R file MyProtNSAin
.
setwd(currentDir) MyProtNSAin <- read.csv(file="protNSAout.csv", row.names=1) MyMarkerListIn <- read.csv(file="markerListJadotOut.csv")
We then remove the preceding single quotes from the protein names.
rownames(MyProtNSAin) <- sub("^'", "", rownames(MyProtNSAin)) MyMarkerListIn[,1] <- sub("^'", "", MyMarkerListIn[,1])
We man examine the files as follows:
round(head(MyProtNSAin), digits=3) head(MyMarkerListIn)
These new R data frames are identical to
protNSA_test
and markerListJadot
. If the user reads in their own data,
be sure that the first row in each of the .csv
files contains column names.
In particular, the names in the first row of the markers file must be
"protName" and "referenceCompartment" to ensure that the resulting
R data frame is in the proper format.
If one desires to work with the full set of 7894 proteins in the Tannous et al. data set, it may be downloaded from the MassIVE repository. To do so, execute the following code:
urlProtProfiles <- "https://massive.ucsd.edu/ProteoSAFe/DownloadResultFile?file=f.MSV000083842/updates/2022-03-14_ablatannous_c39637a3/quant/protNSA_AT5tmtMS2.csv&forceDownload=true" protNSA_AT5tmtMS2 <- read.csv(urlProtProfiles) dim(protNSA_AT5tmtMS2) head(protNSA_AT5tmtMS2)
To run the examples in this vignette using the full data set
protNSA_AT5tmtMS2
in place of the test data set, one may
rename the full data set as follows:
protNSA <- protNSA_AT5tmtMS2
In order to assign proteins proportionately to their respective
compartments, we first use the function locationProfileSetup
to obtain profiles for the compartments based on the means of the
individual reference proteins that represent each compartment.
This function produces a matrix that has one row for each compartment
and one column for each fraction that comprises the profile.
refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, markerList=markerListJadot, numDataCols=9)
We display the data, using rounding to improve readability.
round(refLocationProfilesNSA, digits=4)
To examine the available compartments, view the row names of
refLocationProfilesNSA
.
rownames(refLocationProfilesNSA)
To graphically display a particular compartment profile and its
component proteins, use markerProfilePlot
.
For example, to plot the profile for plasma membrane, input:
markerProfilePlot(refLoc="PM", profile=protNSA, markerList=markerListJadot, refLocationProfiles=refLocationProfilesNSA, ylab="NSA")
This displays the PM compartmental profile (dashed yellow-black line)
and its five component reference proteins (red lines).
To plot, for example, the second PM marker protein,
use the option refProtPlot=2
in the markerProfilePlot
function:
markerProfilePlot(refLoc="PM", profile=protNSA, markerList=markerListJadot, refLocationProfiles=refLocationProfilesNSA, ylab="NSA", refProtPlot=2)
To show individually each of the five reference proteins that are used
to calculate the PM compartmental profile, use par(mfrow=c(3,2))
to set up a plot array with 3 rows and 2 columns which will
accommodate up to six plots on a single page. Then plot all
five of the them one-by-one by looping through the five PM
marker proteins by first specifying for (j in 1:5)
and
invoking the option refProtPlot=j
in the markerProfilePlot
function.
(To make them legible, you may first need to open a
new window by entering windows(width=5, height=7)
.)
The parameters in par(mfrow=c(3,2))
and for (j in 1:5)
can be adjusted to display any number of plots.
par(mfrow=c(3,2)) # this will be new default layout for subsequent plots. # Will need to reset par(mfrow=c(1,1)) for single graph layouts for (j in 1:5) { markerProfilePlot(refLoc="PM", profile=protNSA, markerList=markerListJadot, refLocationProfiles=refLocationProfilesNSA, ylab="NSA", refProtPlot=j) }
To plot all of the compartment and individual marker protein profiles, we may set up a plot window with four rows and two columns and then loop through the eight subcellular compartments:
loc.list <- rownames(refLocationProfilesNSA) n.loc <- length(loc.list) par(mfrow=c(4,2)) for (i in 1:n.loc) { markerProfilePlot(refLoc=loc.list[i], profile=protNSA, markerList=markerListJadot, refLocationProfiles=refLocationProfilesNSA, ylab="NSA") }
Once we are satisfied with the marker set and compartment profiles,
we can run the CPA routine which, for each protein in our data set,
apportions its residency among the different specified compartments.
Note that the command below uses default options. It is also possible
to specify optional parameters including writing out the obtained
goodness of fit minimum (minVal=TRUE
, default is minVal=FALSE
),
setting initial parameters for each CPA value using an eight element vector
specifying the starting proportions
(startProp
, if not specified the function will assign a value of 1/8 for
each of the eight compartments), and by constraining the output CPA values
for specified compartments to be zero (see help file, ?fitCPA
).
The optimization procedure applies a projection onto a subspace in order
to enforce the constraint that the estimated proportions add up to one.
(Chen and Ye, 2011).
protCPAfromNSA <- fitCPA(profile=protNSA, refLocationProfiles=refLocationProfilesNSA, numDataCols=9)
We can view the structure of the output data file and see that the
rows contain the protein names, and the next eight columns contain
the allocation proportions of each protein to the eight compartments.
The last two columns are integers representing the numbers of peptides
and spectra assigned to each respective protein; these are carried over
from the input protProfileSummary
profile data and are not essential
for the CPA analysis.
round(head(protCPAfromNSA), digits=2)
Note that the protein "AIF1" has all missing values, which is why an error is reported for that one protein. While AIF1 is the only one of the 7894 proteins in the full AT5tmtMS2 data set that has no profile, we include it in our 40 protein test data set to show that such cases can occur.
We can look at the profile of, for example, the protein "TLN1". We first ensure that the protein is in the dataset:
protIndex("TLN1", profile=protNSA)
This function also accepts partial matching of the first few letters of a protein. For example, we can find the indices of the proteins starting with "TLN":
protIndex("TLN", profile=protNSA)
Now we plot the results for protein TLN1:
protPlotfun(protName="TLN1", profile=protNSA, numDataCols=9, n.compartments=8, refLocationProfiles=refLocationProfilesNSA, assignPropsMat=protCPAfromNSA, yAxisLabel="Normalized Specific Amount")
The x-axis represents the nine fractions, which are N, M, L1, L2, P, S, Nyc.1, Nyc.2, and Nyc.3. In each of the eight plots, the red line is the average profile of the protein. The dashed yellow-black lines show the expected profile for a protein entirely resident in the respective subcellular location. In this set of plots, we see that the CPA procedure assigns a 57 percent residence proportion to plasma membrane and 36 percent residence to cytosol. The observed red profile would closely resemble a mixture of the yellow-black lines weighted by the indicated proportions.
The protPlotfun
function is designed to plot profiles of eight
subcellular locations. If a data set has more than eight of these,
it will be necessary to modify the code to accommodate the larger number.
To save the results of the CPA procedure, as before,
we prepend the protein names
with a single quote and write out a .csv
file to a specified directory.
protCPAfromNSAout <- protCPAfromNSA rownames(protCPAfromNSAout) <- paste("'", rownames(protCPAfromNSAout), sep="") setwd("C:/temp/myProteinOutput") write.csv(protCPAfromNSAout, file="protCPAfromNSAout_AT5tmtMS2.csv", row.names=TRUE)
To save the plot of a protein (TLN1
for example) as a pdf file,
we first specify a pdf plot window
, call the protPlotfun
function as before, and then close the plot window
using dev.off()
to allow R to complete producing the file:
pdf(file="myPlotPDFfile.pdf", width=7, height=10) protPlotfun(protName="TLN1", profile=protNSA, numDataCols=9, n.compartments=8, refLocationProfiles=refLocationProfilesNSA, assignPropsMat=protCPAfromNSA, yAxisLabel="Normalized Specific Amount") dev.off()
To output plots all of the protein profiles into a single pdf file, one can set up a loop as follows:
pdf(file="allPlotsPDFfile.pdf", width=7, height=10) n.prots <- nrow(protCPAfromNSA) for (i in 1:n.prots) { protPlotfun(protName=rownames(protCPAfromNSA)[i], profile=protNSA, numDataCols=9, n.compartments=8, refLocationProfiles=refLocationProfilesNSA, assignPropsMat=protCPAfromNSA, yAxisLabel="Normalized Specific Amount") } dev.off()
This will result in a single file, allPlotsPDFfile.pdf
,
with a page for each protein plotted. Note that one protein
in our data set, AIF1
, does not have a profile as all peptides
were outliers (see Tutorial 6).
\newpage
In Section 1 (“Getting Started”), we illustrated the principles of
CPA using protein profiles that represent mass spectrometry data
from a set of subcellular fractions in the form of normalized
specific amounts (NSA). NSA profiles have equivalent amounts of
total protein analyzed per fraction, with the sum of all
fractions constrained to 1. This section describes how
to use functions in the protlocassign
package to transform
NSA profiles in ways that may provide more accurate assignments
of proportional residence to subcellular compartments.
As explained in the Section 1, we will demonstrate using
two R data sets that are included in the protlocassign
package.
One, protNSA_test
, consists of row names that indicate protein
identifiers, each of which is followed by data describing the
identifier profile across the nine normalized specific amounts
derived from a subcellular fractionation experiment.
The other data set, markerListJadot
, consists of a list of
reference proteins and their associated known subcellular compartments.
As before, to run the program, the protlocassign
library must be installed.
library(protlocassign)
In Tutorial 1, we use protNSA_test
and an untransformed average of
reference protein profiles in the form of NSA for each compartment to
conduct CPA. However, it may be advantageous to transform the data prior
to conducting CPA to yield a more accurate prediction of cellular location.
For this purpose, we express profile data as RSAs.
As explained in the main text and elaborated in Tutorial 3,
RSA is the ratio of two ratios: the numerator is the amount of a
given protein in a particular fraction divided by the amount of that
given protein in the starting material while the denominator is amount
of total protein in a particular fraction divided by the amount of
total protein in the starting material. The RSA describes the
fold-enrichment (RSA>1) or depletion (RSA<1) of a protein during
the fractionation process, and is analogous to the relative specific
activity term used in classical analytical subcellular fractionation.
Be aware that to perform this transformation, one needs to have
estimates of all these quantities, and this was incorporated
into our experimental design. In our example, the first six fractions
(the differential fractions) can be used to estimate amounts in the
starting material. We also measured total protein in each fraction,
and these are contained in the 9-element vector totProtAT5
which is preloaded in protlocassign
. Note that the order
and numbers of the measurements for total protein
(e.g., N, M, L1, L2, P, S, Nyc1, Nyc2 and Nyc3 in totProtAT5
)
must correspond to those in the data set containing individual
protein profiles (e.g., protNSA_test
). For clarity of presentation,
we rename totProtAT5
and protNSA_test
to
totProt
and protNSA
, respectively.
data(protNSA_test) protNSA <- protNSA_test head(round(protNSA, digits=3)) data(totProtAT5) totProt <- totProtAT5 round(totProt, digits=4)
The function RSAfromNSA
calculates transformed profiles from
individual and total protein measurements. This requires
specifying which values are used to estimate the amount in the starting
material (typically the homogenate) and the values used to
construct the profile. In our case, the first six fractions
of the nine-fraction profile are summed to estimate the starting
material. Our code requires that the fractions representing the
starting material are contiguous and are located at the beginning
of the profile. Note that the function RSAfromNSA
can use protein
profiles expressed either as NSA or as specific amounts
(see Tutorial 3).
Thus we select the first nine columns of protNSA
:
protRSA <- RSAfromNSA(NSA=protNSA[,1:9], NstartMaterialFractions=6, totProt=totProt) dim(protRSA) round(head(protRSA), digits=3) str(protRSA)
Since there is additional information in the last two columns of
protNSA
that we want to include in the new file, specifically the
numbers of spectra and peptides (Nspectra
and Nseq
),
we add them to the output as follows:
protRSA <- data.frame(protRSA, protNSA[,10:11]) #note data frame is being overwritten dim(protRSA) str(protRSA)
We also need to transform the profiles of the markers for each compartment.
As in Tutorial 1, we use the function locationProfilesetup
to
average the profiles (which must be normalized specific amounts)
to obtain profiles for the reference proteins:
data(markerListJadot) refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, markerList=markerListJadot, numDataCols=9) round(refLocationProfilesNSA, digits=4)
We then use RSAfromNSA
to transform these reference profiles.
refLocationProfilesRSA <- RSAfromNSA(NSA=refLocationProfilesNSA, NstartMaterialFractions=6, totProt=totProt) round(refLocationProfilesRSA, digits=4)
We computed the RSA reference profiles above from the NSA reference profiles.
Note that, as an alternative, one could compute the RSA reference profiles
directly from protRSA
. These two approaches yield similar but
non-identical results, and we typically use the first procedure to
generate RSA-transformed reference location profiles.
refLocationProfilesRSA_2 <- locationProfileSetup(profile=protRSA, markerList=markerListJadot, numDataCols=9) round(refLocationProfilesRSA_2, digits=4) # we use the `as.matrix` function for display purposes in the tutorial as.matrix(all.equal(refLocationProfilesRSA, refLocationProfilesRSA_2, precision=0, countEQ=TRUE))
As in Tutorial 1, we can plot reference profiles, but this time using the RSA-transformed data. For example, here is a plot for all markers:
loc.list <- rownames(refLocationProfilesRSA) n.loc <- length(loc.list) par(mfrow=c(4,2)) for (i in 1:n.loc) { markerProfilePlot(refLoc=loc.list[i], profile=protRSA, markerList=markerListJadot, refLocationProfiles=refLocationProfilesRSA, ylab="RSA") }
Now we can run the CPA routine on the RSA-transformed levels. The result is a matrix with protein identifiers as row names and data indicating the estimated proportional assignments of each protein among the eight subcellular locations.
protCPAfromRSA <- fitCPA(profile=protRSA, refLocationProfiles=refLocationProfilesRSA, numDataCols=9)
Note that an error is reported; this is from the attempted fit of the protein “AIF1” which doesn’t have a profile (see Tutorial 1). We examine the output data set as follows:
round(head(protCPAfromRSA), digits=3)
The protPlotfun
function is designed to plot profiles of
eight subcellular locations. If a data set has more than eight of these,
it will be necessary to modify the code to accommodate the larger number.
Now we plot the results for protein TLN1:
protPlotfun(protName="TLN1", profile=protRSA, numDataCols=9, refLocationProfiles=refLocationProfilesRSA, assignPropsMat=protCPAfromRSA, yAxisLabel="Relative Specific Amount")
The x-axis represents the nine fractions, which are N, M, L1, L2, P, S, Nyc.1, Nyc.2, and Nyc.3. In each of the eight plots, the red line is the average profile of the protein and the dashed yellow-black lines show the expected profile for a protein entirely resident in the respective subcellular location. We see that the CPA procedure assigns a 35 percent residence proportion to plasma membrane and 53 percent residence to cytosol. As in Tutorial 1, the observed red profile is a weighted mixture of the expected yellow-black lines.
\newpage
As explained in the main text and in Section 2, there are different
ways to transform protein profile data. Here, we describe a way to
explore the effect of using transformed data with CPA.
Briefly, we conduct different data transformations on a set of
theoretical proteins that have a range of distributions between
two cellular compartments, and then conduct CPA and determine
how well it predicts the original distribution.
To create the theoretical proteins for our simulations, we use
data from the experiment from Tannous et al. which consists of
a TMT-MS2 analysis of six differential fractions (N, M, L1, L2, P, and S)
obtained from centrifugation of a rat liver homogenate and three fractions
from a Nycodenz density gradient separation of the differential
fraction L1 (Nyc1, Nyc2, and Nyc3).
In our procedure, we first use the eight compartment profiles generated from the marker protein set to simulate a set of eight theoretical proteins that wholly reside in each of the respective compartments. We then then create binary mixtures with defined combinations of the eight theoretical proteins to simulate proteins that are distributed in varying proportions between two compartments. Note that data from mass spectrometry experiments generally represent specific amounts ${s_{\alpha ,l}}$ of a protein $\alpha$ in fraction $l$, with the same amount of total protein being analyzed for each sample (fraction). For conducting our simulations, we first must transform this data into relative amounts, so that each protein has precisely the same total amount in the initial starting material used for fractionation.
Consider an $n$ by 9 matrix of specific amounts that details the average distribution of $n$ proteins among 9 fractions: $$ \mathbf{S}=\left[ \begin{matrix} {{s}{11}} & {{s}{12}} & \cdots & {{s}{19}} \ {{s}{21}} & {{s}{22}} & \cdots & {{s}{29}} \ \vdots & \vdots & {} & \vdots \ {{s}{n1}} & {{s}{n2}} & \cdots & {{s}_{n9}} \ \end{matrix} \right] $$
Each row $\alpha$ represents a mean profile for a protein $\alpha$, with each profile consisting of $f=9$ fractions.
In Section 1 we used a normalized specific amount (NSA), denoted here as $\tilde{s}_{\alpha,l}$ calculated as follows:
$$ {{\tilde{s}}{\alpha ,l}}=\frac{{{s}{\alpha ,l}}}{\sum\limits_{j=1}^{f}{{{s}_{\alpha ,j}}}} $$ In some experiments, these are the only values available for using the CPA procedure. However, with appropriate experimental design and execution, using balance sheet analysis (“bookkeeping”), one can estimate the amounts of total protein present in different samples obtained from a set amount of starting material. These include the total protein content of the starting material (designated $t_h$) and the total protein content of any given fraction (designated $t_l$ for fraction $l$). We can use these to calculate the amount of protein in arbitrary units in fraction $l$ derived from a set amount of starting material as $a_l = s_l t_l$. Note that as $a_l$ is in arbitrary units, one can do the same calculation using either $s_l$ or ${\tilde s_l}$, as these will yield the same values when calculating relative amounts (Acup) (see below).
To see how this works in protlocassign
, let us consider the Jadot et al.
reference protein profiles, which we generate using the
locationProfileSetup
function. As in earlier sections,
we rename protNSA_test
as protNSA
and we tailor
the output using the round
function.
library(protlocassign) data(protNSA_test) data(markerListJadot) protNSA <- protNSA_test refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, markerList=markerListJadot, numDataCols=9) round(refLocationProfilesNSA, digits=3)
Here, each row represents the profile $\tilde{s}_l$ for a protein
resident solely in a particular cellular compartment.
The amount of total protein derived from a set amount
of starting material from all fractions in the experiment used
for this vignette is in totProtAT5
, a vector supplied with the
protassign
package. For convenience we rename it totProt
.
data(totProtAT5) totProt <- totProtAT5 round(totProt, digits=3)
We denote these values using a vector $\mathbf{t}= ({t_1},{t_2}, \ldots ,{t_9})$.
As noted above, for protein $\alpha$ in fraction $l$, we have ${{a}{\alpha ,l}}={{\tilde{s}}{\alpha ,l}}{{t}_{l}}$. In matrix form,
$\mathbf{A}=\widetilde{\mathbf{S}}\cdot\mathrm{diag}(\mathbf{t})=\left[\begin{matrix}{\widetilde{s}}{11}t_1&{\widetilde{s}}{12}t_2&\cdots&{\widetilde{s}}{19}t_9\{\widetilde{s}}{21}t_1&{\widetilde{s}}{22}t_2&\cdots&{\widetilde{s}}{29}t_9\\vdots&\vdots&&\vdots\{\widetilde{s}}{n1}t_1&{\widetilde{s}}{n2}t_2&\cdots&{\widetilde{s}}_{n9}t_9\\end{matrix}\right]$
We need this matrix to do the next step, which is to convert to a common scale for all proteins. We do this by normalizing to the amount of protein $\alpha$ in the starting material, which we denote as $a_{\alpha, h}$. If the homogenate is measured directly, this can be calculated from ${\widetilde{s}{\alpha ,h}}{{t}{h}}$. Alternatively, if a complete set of fractions that entirely represent the homogenate are available, it is preferable to calculate this by summing $\widetilde{s}_{\alpha,l}t_l$ over these fractions. In our case, the first six fractions (N, M, L1, L2, P, and S) are a complete set of differential fractions that represent the starting material, and thus:
$${a_{\alpha ,h}} = \sum\nolimits_{i = 1}^6 {{\tilde{s}{\alpha ,i}}{t_i}} $$ and $t_h=\sum{i=1}^{6}t_i$. Note that this is readily calculated by summing the first six elements of $\mathbf{t}$.
sum(totProt[1:6])
(The last three "Nyc" columns were derived from L1, so we do not include them in the sum.)
Finally, we normalize amounts in any given fraction to amounts
in starting material, which we call the relative amount,
designated here as Acup
, and denote as ${\breve{a}}_{\alpha,l}$
In our example,
$${\overset{\smile}{a}{\alpha ,l}} =\frac{a{\alpha,l}}{a_{\alpha,h}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{{s_{\alpha ,h}}{t_h}}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{\sum\nolimits_{i=1}^6 {{s_{\alpha ,i}}{t_i}} }} = \frac{{{{\tilde s}{\alpha ,l}}{t_l}}}{{\sum\nolimits{i = 1}^6 {{{\tilde s}_{\alpha ,i}}{t_i}} }}$$
In matrix form, we may write this as:
$\breve{\mathbf{A}}=\mathbf{A}\cdot\mathrm{diag}(1/a_{{\alpha},h})=\left[\begin{matrix}{ \tilde{s}}{11}t_1/a{1,h}&{ \tilde{s}}{12}t_2/a{1,h}&\cdots&{ \tilde{s}}{19}t_9/a{1,h}\{ \tilde{s}}{21}t_1/a{2,h}&{ \tilde{s}}{22}t_2/a{2,h}&\cdots&{ \tilde{s}}{29}t_9/a{2,h}\\vdots&\vdots&&\vdots\{ \tilde{s}}{n1}t_1/a{n,h}&{ \tilde{s}}{n2}t_2/a{n,h}&\cdots&{ \tilde{s}}{n9}t_9/a{n,h}\\end{matrix}\right]$
or simply as ${\breve{\mathbf{A}}} = [\breve{a}_{\alpha ,l}]$.
The function AcupFromNSA
computes this:
refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, NstartMaterialFractions=6, totProt=totProt) round(refLocationProfilesAcup, digits=4)
The values in refLocationProfilesAcup
represent,
in principle, the relative amount (Acup) of each cellular compartment
distributed to each centrifugation fraction.
When examining the distribution of a given protein in different fractions, it is particularly useful to consider its abundance relative to that of total protein. We refer to this as a RSA or $r$, which can be calculated as ${r_{\alpha ,l}} = \breve{a}{\alpha ,l}/ \breve{t}_l$, where the vector $\breve{\mathbf{t}}=\frac{\mathbf{t}}{t_h}=\frac{\mathbf{t}}{\sum{1}^{6}t_i}$. For a protein $\alpha$ the RSA in fraction $l$ is given by:
$r_{\alpha,l}=\frac{{\breve{a}}{\alpha,l}}{{\breve{t}}_l}=\frac{\frac{{\widetilde{s}}{\alpha,l}t_l}{\sum_{i=1}^{6}{{\widetilde{s}}{\alpha,i}t_i}}}{\frac{t_l}{\sum{i=1}^{6}t_i}}=\frac{s_{\alpha,l}\cdot\sum_{i=1}^{6}t_i}{\sum_{i=1}^{6}s_{\alpha,i}t_i}=\frac{{\widetilde{s}}{\alpha,l}\cdot\sum{i=1}^{6}t_i}{\sum_{i=1}^{6}{\widetilde{s}}_{\alpha,i}t_i}$
In matrix form, this is $\mathbf{R}=\left[ {{r}_{\alpha ,l}} \right]$.
We can get the RSA matrix from refLocationProfilesAcup
and
totProt
as follows:
refLocationProfilesRSA <- RSAfromAcup(refLocationProfilesAcup, NstartMaterialFractions=6, totProt=totProt) round(refLocationProfilesRSA, digits=3)
Note that we can also obtain the RSA transformed data directly from NSA data as described in Section 2:
refLocationProfilesRSA_2 <- RSAfromNSA(NSA=refLocationProfilesNSA, NstartMaterialFractions=6, totProt=totProt)
This yields precisely the same values as calculated previously
using locationProfileSetup
.
identical(refLocationProfilesRSA, refLocationProfilesRSA_2)
Finally, if we normalize an RSA profile so that the rows sum to one,
this yields a normalized specific amount profile (see Appendix of
main paper, Moore et al., 2022). Consider, for example, the matrix
refLocationProfilesRSA
, which contains the RSA-transformed compartment
profiles. We normalize the rows using the apply
function, and then
transpose using the t
function to yield a matrix of the normalized
specific amounts; these values are essentially identical to those that
we started with in refLocationProfiles
.
This is performed using the function NSAfromRSA
:
refLocationProfilesNSA_2 <- NSAfromRSA(refLocationProfilesRSA_2)
Note that refLocationProfilesNSA_2
is not identical to the values
obtained previously using the locationProfileSetup
function with
the protein profiles containing NSA data because of internal precision
issues, but both are essentially equivalent.
as.matrix(all.equal(refLocationProfilesNSA_2, refLocationProfilesNSA, tolerance=0, countEQ=TRUE))
The available transformation functions are AcupFromNSA
,
RSAfromAcup
, RSAfromNSA
(a combination of the previous two),
and NSAfromRSA
. For completeness, we also include the functions
NSAfromAcup
and AcupFromRSA
. All functions except NSAfromRSA
require arguments for NstartMaterialFractions
and totProt
.
These functions allow any profile to be expressed in the form of
NSA
, Acup
, and RSA
.
We may simulate data from proteins with multiple residences using
the proteinMix
function. For example, to simulate data from
proteins that are distributed in a range of proportions between
cytosol and lysosomes, we use this function with relative amounts
(Acup) of their single-compartment profiles.
Note that we need to use Acup-transformed data to create the mixtures
so that the total amount of the given protein summed across all
fractions will be invariant for all mixtures. By default,
we vary the proportions by increments of 0.1, but different values
can be specified using the argument increment.prop
(e.g., increment.prop=0.2
or increment.prop=0.05
).
We specify the mixing locations by the arguments
Loc1=1
and Loc2=4
, which are the row numbers of the desired
compartment profiles in refLocationProfiles
(i.e. Cyto and Lyso):
refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, NstartMaterialFractions=6, totProt=totProt) data.frame(rownames(refLocationProfilesAcup)) mixCytoLysoAcup <- proteinMix(AcupRef=refLocationProfilesAcup, increment.prop=0.1, Loc1=1, Loc2=4) # Note that the default value of increment.prop=0.1. # This does not need to be explicitly # specified unless a different increment is desired.
The result is a matrix that contains the Acup values (relative amounts) for the simulated proteins:
round(mixCytoLysoAcup, digits=3)
Then we can test the CPA algorithm by first converting this mixture data to RSA:
mixCytoLysoRSA <- RSAfromAcup(Acup=mixCytoLysoAcup, NstartMaterialFractions=6, totProt=totProt) round(mixCytoLysoRSA, digits=3)
Finally, we fit the CPA algorithm to this RSA-transformed, simulated data using the previously generated RSA-transformed marker protein profiles:
mixCytoLysoCPAfromRSA <- fitCPA(profile=mixCytoLysoRSA, refLocationProfiles=refLocationProfilesRSA, numDataCols=9) round(mixCytoLysoCPAfromRSA, digits=3)
The estimated proportions correspond closely to the proportions used in the simulation (see below).
Plotting these estimates against the “true” distributions of the simulated proteins provides insights into the effects of different transformations on the goodness of fit. As an introduction, we first illustrate this using plots of simulated proteins distributed in varying amounts between the cytoplasm and the lysosome with CPA conducted on the RSA-transformed data. We then extend this to simulated proteins distributed between the cytoplasm and each of the other compartments. In Tutorial 4 in the supplemental documentation we use the simulated mixtures to evaluate the effect of various transformations on the CPA procedure.
We can use the mixturePlot
function to evaluate the effect of
different data transformations used to represent protein profiles
on the compartmental distributions estimated by CPA.
The function produces graphical representations by plotting
the theoretical distribution (based on simulation parameters, x-coordinate)
versus the predicted values (based on CPA, y-coordinate).
This function also evaluates the prediction error by computing the
area separating the predicted and expected protein mixtures via
the trapezoidal rule; this is done with the trapz
function in the
pracma
library, which must have been previously installed.
We also need to tell the program which two locations were used
to generate the mixtures using Loc1
and Loc2
.
As our first example, we examine the results CPA using the RSA
transformation on simulated proteins distributed between Cyto and Lyso.
library(pracma) par(mfrow=c(1,1)) # reset window for a single plot # The argument increment.prop needs to match the value used # in creating the mixture using proteinMix. This does # not need to be specified if using the value of 0.1 mixturePlot(mixProtiProtjCPA=mixCytoLysoCPAfromRSA, NstartMaterialFractions=6, Loc1=1, Loc2=4, increment.prop=0.1, xaxisLab=TRUE, yaxisLab=TRUE)
Here we see visually that the estimated proportions match the simulated ones. The prediction error, i.e., the area separated by the observed and expected CPA estimates, is nearly zero, and is shown in parentheses.
Next, we consider a mixture of Cyto with each of the other seven
compartments using RSA-based transformations. We begin by setting
up the plot area for a 4 by 2 array of plots. Optionally, to
control the size of the window, we may want to explicitly
open a window using windows(height=10, width=7)
.
Next, we fix one component of the mixture to the first,
which is Cyto, and loop over the other 7 subcellular compartments,
creating mixtures of the refLocationProfilesAcup
values.
For each case, we transform these mixtures to RSA and obtain
CPA mixing proportion estimates from these RSA-transformed
mixtures and compute the area-based prediction errors.
These values are stored in a data frame mixErrorMat
which is
then renamed to avoid overwriting since multiple
mixtures and transformations are being explored.
par(mfrow=c(4,2)) i <- 1 mixErrorMat <- NULL for (j in 2:8) { # Create the mixture of Cyto (i = 1) with compartment j mixProtiProtjAcup <- proteinMix(Acup=refLocationProfilesAcup, Loc1=i, Loc2=j) # Tranform the mixtures to RSAs mixProtiProtjRSA <- RSAfromAcup(Acup=mixProtiProtjAcup, NstartMaterialFractions=6, totProt=totProt) # Find the CPAs mixProtiProtjCPAfromRSA <- fitCPA(profile=mixProtiProtjRSA, refLocationProfiles=refLocationProfilesRSA, numDataCols=9) # Plot the results, including the area-based error estimate, # and collect the area-based errors (errorReturn=TRUE) mixResult <- mixturePlot(mixProtiProtjCPA=mixProtiProtjCPAfromRSA, NstartMaterialFractions=6, Loc1=i, Loc2=j, increment.prop=0.1, errorReturn = TRUE) mixErrorMat <- rbind(mixErrorMat, mixResult) } mixErrorAllCytoRSA <- mixErrorMat
All seven mixtures have essentially zero area-based error, as we expect. We can examine these errors with more precision as follows:
mixErrorAllCytoRSA
\newpage
The Bioconductor "pRoloc" package provides methods to analyze organelle proteomics data and the "MSnbase" package, also from Bioconductor, provides a flexible structure to store this data. In this section we show how to convert the rectanular proteomics data structures of "protlocassign" into "MSnbase" data structures on which "pRoloc" functions may operate.
We illustrate the interface using the "protRSAtest" data discussed above, which we renamed "protRSA".
First, attach the necessary libraries and extract the protein names and data:
library(pRoloc) library(MSnbase) ProteinID <- as.factor(row.names(protRSA[-2,])) protRSAdata <- protRSA[-2,1:9] rownames(protRSAdata) <- 1:nrow(protRSAdata)
Next create an expression data frame containing protein names and RSA abundance levels:
exprsProt <- data.frame(ProteinID, protRSAdata) names(exprsProt) <- as.factor(names(exprsProt))
To create a file of metadata, we first need a list of organelle locations, and then create assignments as follows:
Locations <- c(unique(markerListJadot$referenceCompartment), "unclassified") assign0p8 <- apply(protCPAfromRSA[-2,1:8], 1, assignCPAloc, cutoff=0.8, Locations=Locations)
Finally we create the metadata file with organelle assignments and spectral and peptide counts:
fdataProt <- data.frame(ProteinID, protRSA[-2,10:11], assign0p8) #fdataProt <- data.frame(ProteinID, assign0p8) rownames(fdataProt) <- 1:nrow(fdataProt) fdataProt
To create an MSnSet data structure we need to create a temporary folder and write out these two files in comma-separated format:
currentDir <- tempdir() setwd(currentDir) currentDir write.csv(exprsProt, file="exprs.csv", row.names=FALSE) write.csv(fdataProt, file="fdata.csv", row.names=FALSE)
Finally, we point to these files and use the "readMSnSet" function to create the data structure
f1p <- paste(currentDir, "\\exprs.csv", sep="") f2p <- paste(currentDir, "\\fdata.csv", sep="") protRSAstruct <- readMSnSet(exprsFile=f1p, featureDataFile=f2p, sep=",", header=TRUE, check.names=FALSE)
We may examine the expression data and meta data as follows:
head(exprs(protRSAstruct)) head(fData(protRSAstruct))
Having created the data structure "protRSAstruct" we may apply pRoloc functions to it. For example, to create a sub-cellular dendrogram, we may use the mrkHClust function, after first using "fvarLabels" to identify the column names:
fvarLabels(protRSAstruct) mrkHClust(protRSAstruct, fcol="\"assign0p8\"")
Similarly, we may create a heat map of average oranelle class profiles as follows:
fmat <- mrkConsProfiles(protRSAstruct, fcol="\"assign0p8\"") plotConsProfiles(fmat)
One may also extract data from a MSnbase proteomics data structure for use by protlocassign. We illustrate with the "protRSAstruct" that we just created:
exprsDataTemp <- exprs(protRSAstruct) fDataTemp <- fData(protRSAstruct) exprsData <- data.frame(exprsDataTemp, fDataTemp[,1:2]) head(exprsData)
This data frame is in the same format as the original file, "protRSA", which we defined previously.
The current vignette provides an introduction to the basic concepts of using the protlocassign package. The more extensive set of tutorials is available in the "protlocassigndoc" package; details on obtaining this were given in Section 1 of this vignette. The first three tutorials are similar to the contents of this vignette. Tutorial 4 discusses using CPA on simulations of proteins that are resident in two compartments, and Tutorial 5 discusses how these results change when we reformat the data into the form of a classical five-fraction differential fractionation experiment. Tutorial 6 presents details on preparing spectral-level data for further CPA analysis. Finally, Tutorial 7 presents methods of finding proteins that have similar profiles to a given protein.
Details of how data are pre-processed for input into the protlocassign
package may be found in the protlocassignDataDescription.md
file
in the \inst
subdirectory of the R package. We note that many of
these pre-processing steps could, as an alternative, be carried out
within R using the MSnSet
facilities also used by the pRoloc
package. An advantage of this approach is that these packages provide
a natural way to store metadata such as protein markers and their
localization as well including built-in validity checks and visualization
tools. See the references by L Gatto, LM Breckels, KS Lilley and colleagues
for details.
Chen, Yunmei and Ye, Xiaojing. Projection Onto A Simplex. arXiv 2011 https://arxiv.org/abs/1101.6081
Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics. 2014 May 1;30(9):1322-4. doi:10.1093/bioinformatics/btu013. Epub 2014 Jan 11. PubMed PMID: 24413670; PubMed Central PMCID: PMC3998135.
Gatto L, Gibb S. Rainer J. MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868
Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.
Jadot, M.; Boonen, M.; Thirion, J.; Wang, N.; Xing, J.; Zhao, C.; Tannous, A.; Qian, M.; Zheng, H.; Everett, J. K., Accounting for protein subcellular localization: A compartmental map of the rat liver proteome. Molecular & Cellular Proteomics 2017, 16, (2), 194-212. https://doi.org/10.1074/mcp.M116.064527
Moore DF, Sleat D, Lobel P. A method to estimate the distribution of proteins across multiple compartments using data from quantitative proteomics subcellular fractionation experiments. Journal of Proteome Research, 2022, 21, 6, 1371-1381 https://doi.org/10.1021/acs.jproteome.1c00781
Tannous, A.; Boonen, M.; Zheng, H.; Zhao, C.; Germain, C. J.; Moore, D. F.; Sleat, D. E.; Jadot, M.; Lobel, P., Comparative analysis of quantitative mass spectrometric methods for subcellular proteomics. J Proteome Res 2020, 19, (4), 1718-1730. https://doi.org/10.1021/acs.jproteome.9b00862
Tran, A.N., Dussaq, A.M., Kennell, T. et al. HPAanalyze: an R package that facilitates the retrieval and analysis of the Human Protein Atlas data. BMC Bioinformatics 20, 463 (2019) https://doi:10.1186/s12859-019-3059-z
Following is the output of the utility sessionInfo
.
This output contains details of the packages and version
numbers used to generate these tutorials.
print(utils::sessionInfo(), width=80)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.