This vignette might be a bit overwhelming for a first-time user.
You might want to skim through this vignette and then take a look at the same two examples stripped to their essential analyses. These can be found at https://github.com/statOmics/MSqRobData/blob/master/inst/extdata/Francisella/analysis_Francisella.Rmd and https://github.com/statOmics/MSqRobData/blob/master/inst/extdata/CPTAC/analysis_CPTAC.Rmd for the Francisella and CPTAC example respectively.
Note that we mainly focus on MaxQuant output, but that any type type of search engine output can be used with MSqRob, see "1.2. Importing other data types" for more info.
If anything is unclear, please do not hesitate to contact me at ludger.goeminne@epfl.ch or to open an issue on GitHub. Your feedback is highly welcomed!
This vignette will guide you through the main features of the MSqRob package. We will highlight two examples in order to help you familiarize with its main features.
The MSqRob package is designed for the analysis of differential abundance in label-free mass spectrometry-based shotgun proteomics experiments. These experiments typically output large spectral files (e.g. ".mzML" files) that need to be searched for identifications by specialized peak detection and peptide identification software such as Andromeda (used in the MaxQuant software package) and Mascot Distiller. These algorithms output lists of putative peptides accompanied by their corresponding overall intensity estimates.
Often, researchers further summarize these peptide intensities to protein intensities. However, as we have shown before, this summarization-based approaches tend to perform suboptimal compared to peptide-based models, wherein the peptide intensities are directly provided as input to the statistical model used for differential quantification (Goeminne et al., 2015). Nonetheless, most of these models still suffer from overfitting issues.
This package implements the robust ridge approach that prevents overfitting in sparse data but also accurately fits highly abundant proteins as described in Goeminne et al. (2016).
MSqRob is completely free and open source, but please make a reference to Goeminne et al. (2016) when using our package in any kind of publication.
On Windows, make sure that RTools is installed. Go to: https://cran.r-project.org/bin/windows/Rtools/ to download RTools. A user guide on how to install RTools on Windows can be found at: https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows. Errors in MSqRob on Windows related to unable to zip the results Excel file might be related to errors in configuring RTools.
First, we need to install the package devtools:
#Change to eval=TRUE to execute this piece of code. install.packages("devtools", repos = "http://cran.us.r-project.org") library(devtools)
Then, we call this to install MSqRob from GitHub:
#Change to eval=TRUE to execute this piece of code. devtools::install_github("statOmics/MSqRob") library(MSqRob)
Note that the latest version of MSqRob can always be found at: https://github.com/statOmics/MSqRob.
The latest version of this vignette can be found at: https://github.com/statOmics/MSqRob/blob/master/vignettes/MSqRob.Rmd.
Loading the package is done via the library
command, as with any other package.
library(MSqRob)
Additionally, we load the packages Biobase
, MSnbase
and limma
, as we will use a few of their functions in this vignette.
library(Biobase) library(MSnbase) library(limma)
R packages uploaded on CRAN have a maximum size limit of 5 MB. As proteomics datasets are typically several hunderds of GB in size and even an output file of a peptide search can be tens of MBs in size, we uploaded our example data in a separate data package called MSqRobData. If you want to run the examples in this vignette, you also need the data package. This can also be downloaded from GitHub in the following way:
#Change to eval=TRUE to execute this piece of code. devtools::install_github("statOmics/MSqRobData") library(MSqRobData)
If you want to work with the Shiny graphical user interface (MaxQuant output only), use the following command:
#Change to eval=TRUE to execute this piece of code. shiny::runApp(system.file('App-MSqRob', package='MSqRob'))
The bacterium Francisella tularensis is a facultative intracellular parasite of macrophages and must import host cell arginine in order to be able to multiply inside the host cell. Ramond et al. (2015) investigated the effect of gene deletion of a newly identified arginine transporter called ArgP. Their data is deposited in the PRIDE repository at http://www.ebi.ac.uk/pride/archive/projects/PXD001584. We used the authors' supplied peptides.txt file which is the result of preprocessing with MaxQuant version 1.4.1.2. As the authors used now outdated GI numbers, we modified this file to incorporate NCBI accessions and protein names (September 25, 2016).
MSqRob data input can come from any kind of source, as long as feature-level or peptide-level data is available. I will here focus on a MaxQuant example, but in fact, output from any search engine can be used.
First, specify the location of the peptides.txt MaxQuant output file.
By default, MaxQuant creates a "combined" folder in the same folder as the raw files are (or if the raw files are in different directories, in the directory of the first raw file). The peptides.txt file can then be found in "path_to_raw_files/combined/txt/peptides.txt". Just provide the path to where the file is located on your computer. The system.file
function is used here only to access the example tab-delimited peptides.txt file that is delivered with our package.
file_peptides_txt <- system.file("extdata/Francisella", "peptides.txt", package = "MSqRobData") file_peptides_txt
Import the data using the import2MSnSet
function. The pattern
argument should contain a text string that is present in all column names that contain peptide intensities but not in the other column names. This defaults to "Intensity\ "
for MaxQuant data. The default setting remove_pattern=TRUE
then removes the pattern from these column names so that the names equal the names that were entered in MaxQuant's "Experiment" column when conducting the search. In this particular Francisella experiment, our column names start with a number (either "1" or "3"), which is not a valid column name in R. Therefore, an "X" is prepended to these names.
peptidesFranc <- import2MSnSet(file_peptides_txt, filetype = "MaxQuant", remove_pattern = TRUE)
import2MSnSet
creates an MSnSet
object with 3 slots:
exprs
: contains a matrix of peptide intensities of which the rows correspond to the peptides present in the dataset and the columns correspond to the different mass spec runs.
fData
: contains a data frame containing all additional information (such as peptide sequences, protein names, scores, number of missed cleavages, etc.) present in the input file.
pData
: contains a data frame with 0 columns and as many rows as there are mass spec runs. This slot will be used to store the experimental annotation.
If you want to assess whether importing happened in a correct way, you can always check the exprs
, fData
and pData
slots.
The exprs
slot should contain all the peptide intensities grouped per column.
head(exprs(peptidesFranc))
The fData
slot contains all other columns.
head(fData(peptidesFranc))
The pData
slot will contain the experimental annotation. Right now, it has 18 rows (equal to the number of mass spec runs) and 0 columns (as the experimental annotation is not yet loaded).
pData(peptidesFranc)
The import2MSnSet
has built-in support for moFF (Argentini et al., 2016), mzTab (Griss et al., 2014) and Progenesis (Nonlinear Dynamics, Newcastle upon Tyne, U.K.) output. You should then simply set filetype
to "moFF"
, "mzTab"
or "Progenesis"
respectively.
If you are using yet another data type, you can use MSqRob's read2MSnSet
function to import your data. Hereby, you have to specify (1) your file, (2) a regular expression pattern that indicates the columns with the peptide intensities or simply the indices of these columns and (3) the separator (e.g. "\t"
, ","
, ...).
Now, we are going to create a data frame that contains the experimental annotation. This data frame should contain all explanatory variables that groups different runs together. This is important in order to be able to include factors on which you will do inference (such as treatment effects or effects over time, or, in our case: the effect of the genetic knock-out) as well as other factors that might have an influence (such as biological repeats, technical repeats and potential confounders such as batch effects).
Note that the "experimental annotation" can also be given as an Excel file or a tab delimited file! This file should have the column names as a header (i.e. the first row in the file) and the same structure as the exp_annotation
data frame we create below. If you have your experimental annotation in a file, just provide the filepath to this annotation file in the preprocess_MaxQuant
function.
Also note that exactly one column in the experimental annotation should contain the names of the mass spec runs. These names should be equal to the names you entered in the "Experiment" column in MaxQuant (i.e. the column names of the exprs
slot of the peptidesFranc MSnSet
object). In our case, we will call this column "run" (see below). By default, MSqRob guesses this column by taking the only column of which its elements correspond to the column names of the exprs
slot. If there is no such a column, or there are multiple such columns, MSqRob throws an error.
In this example, we create the experimental annotation data frame ourselves.
Extract the names of the mass spec runs. The mass spec run names are equal to the column names of the exprs
slot of the peptidesFranc MSnSet
object. Other columns in the experimental annotation data frame will typically be derived from this column.
runs <- sampleNames(peptidesFranc) runs colnames(exprs(peptidesFranc))
Note again that the sample names have a prepended "X" because a syntactically valid name in R cannot start with a number.
Add a factor for the genetic knock-out, which we will call "genotype" (i.e. wild type (WT) or knock-out (KO)).
#Create the "genotype" vector genotype <- factor(c(rep("WT",9),rep("KO",9)), levels=c("WT","KO")) genotype
Add a factor for each of the 6 biological replicates (i.e. 3 biological repeats for each genotype).
#Create a factor for the biological replicates n <- nchar(runs) biorep <- as.factor(paste0("b_",rep(1:6,each=3))) biorep #Alternative: extract the biological repeat based on the mass spec run names. # biorep <- as.factor(paste0("b_",factor(as.numeric(factor(paste0(substr(runs, 1,1),substr(runs, n-2,n-2))))+c(rep(0,9),rep(3,9)))))
Each of the 18 technical repeats are here represented by the 18 mass spec runs.
Create an experimental annotation data frame.
exp_annotation <- data.frame(run=runs, genotype=genotype, biorep=biorep) exp_annotation
Note that you can also import the experimental annotation from a tab-delimited or Excel file. As an example, we show you here how you can import data from an experimental annotation file delivered with our package. First, provide the path to where the data is saved.
file_exp_annotation <- system.file("extdata/Francisella", "label-free_Francisella_annotation.xlsx", package = "MSqRobData") file_exp_annotation
Then, import the data using the read.xlsx
function from the openxlsx
package and convert the columns of the data frame to factors.
#Import Excel file exp_annotation_xlsx <- openxlsx::read.xlsx(file_exp_annotation) #Convert columns of data frame to factors exp_annotation_xlsx <- as.data.frame(unclass(exp_annotation_xlsx)) exp_annotation_xlsx
Note that here, there is no "X" prepended to the run names. However, names will be changed automatically to syntactically valid names by MSqRob during preprocessing.
The identification of peptides that carry one or more chemical modifications is often not as reliable as the identification of unmodified peptides. Therefore, a common step during the preprocessing of proteomics data consists of removing the proteins of which all identified peptides have one or more modification sites. If we want to remove these proteins that are only identified by modified peptides from the dataset, we also need the proteinGroups.txt file. This file can be found in the same location as the peptides.txt file ("path_to_raw_files/combined/txt/proteinGroups.txt"). Here, we make use of the proteinGroups.txt file that is delivered with the package.
Note: if you don't want to remove proteins that are only identified by modified peptides, set the remove_only_site
argument of the preprocess_MaxQuant
function to FALSE
and leave the file_proteinGroups
argument at its default value NULL
.
Define the path to the proteinGroups.txt file.
file_proteinGroups <- system.file("extdata/Francisella", "proteinGroups.txt", package = "MSqRobData") file_proteinGroups
Add to the useful_properties
argument all column names of the fData
slot of the peptidesFranc MSnSet
object that you would like to keep (such as gene names, protein names, ontologies, etc.). Other columns will be removed for improved efficiency and memory usage. Here, we would like to keep the GI number and the protein names. These 2 columns are present in our peptides.txt file, but note that you can use basic R manipulations to add other columns to the fData slot if you want to. We certainly want to keep the "Sequence" slot, as peptide-specific effects must be included in our model.
peptidesFranc2 <- preprocess_MaxQuant(peptidesFranc, accession="Proteins", exp_annotation=exp_annotation, logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("GI number","Protein.names","Sequence"), filter=c("Contaminant","Reverse"), remove_only_site=TRUE, file_proteinGroups=file_proteinGroups, filter_symbol="+", minIdentified=2)
preprocess_MaxQuant
internally executes the following preprocessing steps in this order:
Peptide intensities are (log2-)transformed (logtransform=TRUE, base=2
).
As a normalization approach, we choose to quantile normalize the peptide intensities (normalisation="quantiles"
). Other options are "sum"
, "max"
, "center.mean"
, "center.median"
, "quantiles.robust"
, "vsn"
or "none"
(see MSnbase::normalize
for more information).
Handling overlapping protein groups (smallestUniqueGroups=TRUE
): in our approach a peptide can map to multiple accessions (accession="Proteins"
), as long as there is none of these proteins present in a smaller subgroup. Peptides that map to protein groups for wich also subgroups are present in the dataset are removed from the dataset. (Proteins in a protein group are separated by a ;
).
Remove contaminants and reverse sequences (filter=c("Contaminant","Reverse"), filter_symbol="+"
): each row with the filter symbol ("+" in our case) in the columns "Contaminant" and "Reverse" will be removed from the data. Note that the "Contaminant" column in older MaxQuant peptides.txt files can also be called "Potential.contaminant".
Remove all proteins that are only identified by peptides carrying a modification site (remove_only_site=TRUE
, file_proteinGroups=file_proteinGroups
).
Remove all superfluous columns in the fData
slot, except the "Proteins", "GI.number", "Protein.names" and "Sequence" columns (useful_properties=c("GI.number","Protein.names","Sequence")
). Note that the "Proteins" column does not have to be provided: the accession
column is always retained.
Remove all peptides that are only identified once in the dataset (minIdentified=2
).
Add experimental annotation (exp_annotation=exp_annotation
).
We can now make density plots to inspect the effect of the preprocessing on the distribution of our log2-transformed peptide intensities.
Here we take a closer look at how you can make the diagnostic plots from the Shiny app. We will plot both the density plots and the MDS plot.
#Make sure the order of the rows in your exp_annotation is the same as the columns of the exprs slot in your peptides object. all(exp_annotation[,"run"]==colnames(exprs(peptidesFranc))) #Color by biological repeat colors <- as.numeric(exp_annotation[,"biorep"]) ### Density plot before preprocessing ### densAll <- apply(log2(exprs(peptidesFranc)),2,density,na.rm=TRUE) #Automatically determine the plot window size ymax=max(vapply(densAll,function(d) max(d$y),1)) rangematrix <- vapply(densAll,function(d) range(d$x, na.rm=TRUE), c(1,1)) #no longer xlim=range(rangematrix,na.rm=TRUE) ylim=c(0,ymax) plot(densAll[[1]],col=colors[1],xlim=xlim,ylim=ylim, las=1, frame.plot=FALSE, main="Density before preprocessing") for (i in 2:length(densAll)) lines(densAll[[i]],col=colors[i]) ###### ### Density plot after preprocessing ### densAll <- apply(log2(exprs(peptidesFranc2)),2,density,na.rm=TRUE) #Automatically determine the plot window size ymax=max(vapply(densAll,function(d) max(d$y),1)) rangematrix <- vapply(densAll,function(d) range(d$x, na.rm=TRUE), c(1,1)) #no longer xlim=range(rangematrix,na.rm=TRUE) ylim=c(0,ymax) plot(densAll[[1]],col=colors[1],xlim=xlim,ylim=ylim, las=1, frame.plot=FALSE, main="Density after preprocessing") for (i in 2:length(densAll)) lines(densAll[[i]],col=colors[i]) ######
We can also construct an MDS plot. This plot will cluster similar runs together.
plotMDS(exprs(peptidesFranc2), col=colors, las=1)
The accession
argument denotes by which column the individual peptide observations should be grouped. The annotations
argument indicates which columns in peptidesFranc2 are annotations linked to the accessions (in our case: the GI number and the protein names). These annotations will be added to a separate annotation
slot.
proteinsFranc <- MSnSet2protdata(peptidesFranc2, accession="Proteins", annotations = c("GI.number", "Protein.names"))
Imagine you forgot a factor in your experimental annotation. Should you redo the whole preprocessing part? No, often, you can still add this factor thanks to the addVarFromVar
function. We will now show how we can add an extra (redundant) factor called techrep
.
Create a factor for each of the 18 technical replicates (each sample was analyzed in triplicate). This factor is completely redundant, as the "run" column in the experimental annotation already has 18 different levels and is thus already equal to the factor for the technical replicates.
techrep <- as.factor(paste0("t_",1:18)) names(techrep) <- levels(exp_annotation$run) techrep proteinsFranc2 <- addVarFromVar(proteinsFranc, basecol="run", name="techrep", vector=techrep) proteinsFranc2
Here, we illustrate how you can reproduce the detail plots from the Shiny app. We will take a closer look at the protein "WP_003039212".
protein_name <- "WP_003039212" data <- getData(proteinsFranc[protein_name]) #We plot the peptide sequences as different point symbols points <- as.numeric(droplevels(as.factor(data$Sequence))) #We don't want to use repeated plot symbols or invisible symbols pch_vals <- c(0:25,32:127) #Repeat pch_vals if there would be more than 122 unique levels pch_vals <- rep_len(pch_vals, length(unique(points))) pch_vals <- pch_vals[points] #We color by mass spec run colors <- grDevices::colorRampPalette(RColorBrewer::brewer.pal(8,"Spectral"))(length(unique(droplevels(as.factor(data$run))))) colors <- colors[as.numeric(droplevels(as.factor(data$run)))] #We make a different boxplot for each genotype boxplot(data$quant_value~as.factor(data$genotype), outline=FALSE, ylim=c(min(data$quant_value)-0.2,max(data$quant_value)+0.2), ylab="preprocessed peptide intensity", xlab="", main=paste0("Detail plot for protein ", protein_name), las=2, frame.plot=FALSE, frame=FALSE, col="grey", pars=list(boxcol="white")) #Add the data as separate points points(jitter((as.numeric(data$genotype)), factor=2),data$quant_value, col=colors, pch=pch_vals)
Fit the robust ridge models for each protein. The fit.model
function will fit a model to each protein in the protdata
object proteinsFranc
. The default setting reg="ridge"
indicates that ridge regression should be used. The default setting weights="Huber"
indicates that an M estimation algorithm should be used that assigns Huber weights to each peptide observation based on their residuals.
Caution: fitting the models takes some time (less than 15 minutes for the Francisella tularensis dataset on our system).
system.time(protLMFranc <- fit.model(proteinsFranc, response="quant_value", fixed=c("genotype"), random=c("biorep","run","Sequence"), add.intercept=TRUE, weights="Huber"))
The system.time
function was only added to output the duration of the model fitting. If you are not interested in this, you can of course omit it.
protLMFranc
is a protLM
object.
class(protLMFranc)
Printing out a protLM
object returns an overview of the first 6 accessions with their annotations and their corresponding models.
protLMFranc
The numbers in the Std.Dev.
column give the estimated standard deviations of the random effects. Effects starting with ridgeGroup
indicate shrunken fixed effects (we make use of the random effects framework of the lm4
package in order to estimate Ridge penalties). However, as the fixed effects in the ridgeGroups are orthogonalized using the Gramm-Schmidt procedure, these numbers are not very informative.
A protLM
object has three slots:
accession
slot: a vector containing all the accessionsmodel
slot: a list of all fitted models (one for each accession)annotation
slot: data frame of which each row (one for each accession) corresponds to the annotations of a different accession.These slots can be accessed using the getAccessions
, getModels
and getAnnotations
functions respectively. Here we give the accessions, models and annotations for the first 5 models.
getAccessions(protLMFranc)[1:5] getModels(protLMFranc)[1:5] getAnnotations(protLMFranc)[1:5,]
protLM
objects can be subsetted using either their numeric index or their corresponding accession.
Retain only the part corresponding to model 24 and model 50.
protLMFranc[c(24,50)]
Inspect the protLM
object corresponding to accession "WP_003033643"
.
protLMFranc["WP_003033643"]
Say we want to inspect cell division protein "WP_003040227"
. We can then select protein "WP_003040227"
from the protLM
object. We can extract the model via the getModels
function. simplify=TRUE
is the default setting and indicates that if you select one model, you only want this model and not a list with as only element this model. Note that if you select multiple models, you automatically receive a list of models.
modelWP_003040227 <- getModels(protLMFranc["WP_003040227"], simplify=TRUE) modelWP_003040227
The betaBVcovDf
function extracts a list from a model containing the following four elements:
NULL
.betaBVcovDf <- getBetaVcovDf(modelWP_003040227)
betaBVcovDf$beta betaBVcovDf$vcov betaBVcovDf$df betaBVcovDf$df_exp betaBVcovDf$sigma
The getBetaVcovDfList
function can be applied on (a subset of) the protLM
object protLMFranc
. This returns a list containing one or more lists similar to the betaBVcovDf
list.
betaVcovDfList <- getBetaVcovDfList(protLMFranc[1:2]) #Inspect the structure of the list without printing it out compeletely: str(betaVcovDfList)
You can save your proteinsFranc
and protLMFranc
objects to a .RDatas object that can be loaded in R or in the Shiny graphical user interface. If you want your .RDatas object to be loaded into the Shiny App, you should also specify the levelOptions
, random
and plot2DependentVars
objects. These objects are settings that are saved automatically when using the MSqRob Shiny App. In this example, I save the .RDatas object to my desktop.
#Change to eval=TRUE to execute this piece of code. result_files_new <- list() result_files_new$proteins <- proteinsFranc result_files_new$models <- protLMFranc ### This part is only needed if you want to load your .RDatas file in the MSqRob Shiny App ### #Should contain all contrast options for the fixed effects (see also 5. Statistical inference) result_files_new$levelOptions <- c("genotypeWT","genotypeKO") #Should contain all random effects result_files_new$random <- c("Sequence","biorep","run") #Should contain a list with all fixed and random effects result_files_new$plot2DependentVars <- as.list(c("genotype","Sequence","biorep","run")) ############################################################################################# saves_MSqRob(result_files, file="/Users/lgoeminn/Desktop/project_Francisella.RDatas") #Change file path to your desired save location!
If you want to load your saved file, first specify the folder where it is saved.
#Change to eval=TRUE to execute this piece of code. result_path <- "/Users/lgoeminn/Desktop/project_Francisella.RDatas"
To inspect the different objects in the file, you can use the inspect_loads_MSqRob
function.
#Change to eval=TRUE to execute this piece of code. inspect_loads_MSqRob(result_path)
To load this file and assign new protdata
and protLM
object to it, execute the following piece of code.
#Change to eval=TRUE to execute this piece of code. #To load the "project_Francisella.RDatas" file result_files <- loads_MSqRob(result_path) #assign a new protdata object to the saved proteins proteinsFranc_result <- result_files$proteins #assign a new protLM object to the saved models protLMFranc_result <- result_files$models
We are interested in which proteins differ significantly in abundance between ArgP mutant and wild type Francisella tularensis. Via the "MSqRob_levels"
attribute, you can check which factor levels are present in each model. Pick a model for which there are no missing levels for the factors for which you want to test contrasts and inspect the names. Note that these names are always the combination of the parameter name and the factor level.
attr(modelWP_003040227,"MSqRob_levels")
Here, we notice that the mutant is encoded as "genotypeKO"
and the wild type as "genotypeWT"
. We construct a contrast matrix L
for inferring on the fold change of interest. We are interested in the log2 fold change between mutant and wild type. As all intensities are log2-transformed, we should take the difference between "genotypeKO"
and "genotypeWT"
(because of the following property of logarithms: log2(a/b)=log2(a)-log2(b)).
L <- makeContrast(contrasts=c("genotypeKO-genotypeWT"), levels=c("genotypeWT","genotypeKO")) L
Imagine we would want to test the average log2 protein intensity between mutant and wild type, then L
would be constructed as follows:
L2 <- makeContrast(contrasts=c("genotypeKO/2+genotypeWT/2"), levels=c("genotypeWT","genotypeKO")) L2
Important notification: do not use the estimated parameters beta to determine the factor levels! You will notice for example that the level "genotypeWT"
is missing in the column names of the betas. This is because "genotypeWT"
is the reference class, giving the parameter corresponding to "genotypeKO"
the interpretation of the effect of "genotypeKO"
relative to "genotypeWT"
. It is important however to include "genotypeWT"
in the contrast matrix because for proteins where factor levels are missing, these parameters can get different interpretations. The makeContrast
function takes this into account and will return NA
when a contrast cannot be properly estimated.
betaBVcovDf$beta
Build a data frame containing estimate, standard error, degrees of freedom, T-value and p-value for each protein in the protLM
object protLMFranc, by using the test.protLMcontrast
function.
resultFranc <- test.protLMcontrast(protLMFranc, L)
The prot.p.adjust
function adds a new column "qval"
to the data frame based on an existing "pval"
column. The new column contains adjusted p-values using one of the methods in the p.adjust.methods
vector.
resultFranc <- prot.p.adjust(resultFranc, method="fdr")
The prot.signif
function needs a matrix (or a list of matrices) containing a column named "pval" and a column named "qval"
. The "pval"
column will be used to sort the matrix. The function generates a new column "signif"
containing 1 for all values of "qval"
with a value lower than the specified FDR level (default: 0.05) and 0 otherwise.
resultFranc <- prot.signif(resultFranc, level = 0.05)
Inspect the first 6 significant proteins:
head(resultFranc,6)
It is also possible to import your data as a simple data frame and make use of some custom preprocessing pipeline. Here, we will show the same preprocessing pipeline for the Francisella dataset as outlined above, but note that adding, removing or modifying steps in the preprocessing pipeline is now very simple as the data is in data frame format.
First, we import the Francisella dataset as a data frame via the base R function read.table
.
pepFrancDf <- read.table(file = file_peptides_txt, sep = "\t", header = TRUE, quote="", comment.char = "")
Of course, other filetypes can also be imported. Say for example your input file is a comma separated file stored in My Documents on a Windows computer, you could use for example:
#Change to eval=TRUE to execute this piece of code. pepFrancDf <- read.table("C:/Users/Ludger/Documents/peptides.csv", sep = ",", dec=".", header = TRUE, quote="", comment.char = "")
The data frame pepFrancD
is in wide format.
head(pepFrancDf)
If we want to mimick the MaxQuant preprocessing pipeline, we need to add a factor that indicates whether a protein is only identified by modified peptides. This requires some basic R programming. The information about which proteins are only identified by modified peptides can be found in the proteinGroups.txt file. We again make use of the proteinGroups.txt file that is delivered with the package.
#Import the proteinGroups.txt file delivered with the MSqRobData package file_proteinGroups <- system.file("extdata/Francisella", "proteinGroups.txt", package = "MSqRobData") proteinGroups <- read.table(file_proteinGroups, sep="\t", header=TRUE, quote="", comment.char = "") #Extract the column that indicates which proteins are only identified by a modification site only_site <- proteinGroups$Only.identified.by.site #If there are no such proteins (as is the case here), the column will be completely empty and R will import this by default as "NA". However, we want the an empty value instead of "NA" to keep this column consistent with the "Contaminant" and "Reverse" columns. only_site[is.na(only_site)] <- "" #Select the protein accessions that are only identified by one or more modified peptides filter_symbol <- "+" removed_proteins <- proteinGroups$Protein.IDs[only_site==filter_symbol] #Create a logical variable "rem" that holds "TRUE" if a row in the pepFrancDf data frame should be removed and FALSE otherwise. accession <- "Proteins" rem <- as.character(pepFrancDf[,accession]) %in% as.character(removed_proteins) #Create a new column "only_site" in "pepFrancDf" that indicates with a "+" which rows should be removed. pepFrancDf$only_site <- "" pepFrancDf$only_site[rem] <- "+"
Note that in this particular case, the previous chunck of code was superfluous, as there are no proteins in the dataset that are only identified by modified peptides.
We can easily preprocess the data frame using the preprocess_wide
function (for data in "long" format, use the preprocess_long
function). This function uses our standard preprocessing pipeline on data frames in "wide" format (i.e. the data frame has one observation row per subject with each measurement present as a different variable).
We already made the exp_annotation
object before (see "2.1. Create an experimental annotation data frame."), so we will not repeat this code again here. Importantly, however, contrary to the exprs
slot in the peptidesFranc2
object created before, the names of all intensity columns in data frame pepFrancDf
start with "Intensity.".
This is because the default setting remove_pattern=TRUE
in the import2MSnSet
function removes the prefix "Intensity\ "
from the columns containing the intensities as this allows users that are unfamiliar with R to just copy and paste the names they provided in MaxQuant's "Experiment" column as an input. Here, we just imported the peptides.txt file as it is. We thus create a new experimental annotation data frame called exp_annotation2
with run names starting with "Intensity." (a space is not syntactically valid in R, so it is automatically converted to a dot).
runs2 <- colnames(pepFrancDf)[grepl("Intensity.",colnames(pepFrancDf))] runs2 #Extract the relevant part of the mass spec run names that points to the genotype. genotype2 <- factor(substr(runs2, 11,13)) genotype2 n <- nchar(runs2) biorep2 <- as.factor(paste0("b_",factor(as.numeric(factor(paste0(substr(runs2, 10,10),substr(runs2, n-2,n-2))))+c(rep(0,9),rep(3,9))))) biorep exp_annotation2 <- data.frame(run=runs2, genotype=genotype2, biorep=biorep2) exp_annotation2
The preprocess_wide
function allows for the same preprocessing as the preprocess_MaxQuant
function on a regular data frame. If you inspect this function, you will notice that the same 8 preprocessing steps are executed. Remember that you can do any preprocessing you like using basic R data frame manipulations. The split
argument indicates which character string is used to separate the accession groups. The exp_annotation
argument, like before, contains the experimental annotation data frame or a character string indicating the filepath where the experimental annotation is saved as an Excel file or a tab-delimited file. The quant_cols
argument contains a vector of names or a vector of numeric indices pointing to the columns containing the quantitative values of interest (mostly intensities or areas under the curve). If quant_cols
contains only one character string, each column containing this pattern will be regarded as a column containing the quantitative values of interest. aggr_by
indicates the column by which the data should be aggregated. Here, the data is already aggregated to the peptide level (i.e. over different charge states and modification statuses). We need no further aggregation, thus we provide the "Sequence" column (the level to which aggregation has already been done). aggr_function
will be superfluous here, but if further aggregation would have been necessary, raw intensities would be summed up (aggr_function="sum"
). The other parameters are similar to the preprocess_MaxQuant
function.
pepFrancDf2 <- preprocess_wide(pepFrancDf, accession="Proteins", split=";", exp_annotation=exp_annotation2, quant_cols="Intensity.", aggr_by="Sequence", aggr_function="sum", logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("GI.number","Protein.names","Sequence"), filter=c("Contaminant","Reverse","only_site"), filter_symbol="+", minIdentified=2)
Note that the experimental annotation is now present as an attribute of the pepFrancDf2
data frame.
attr(pepFrancDf2,"MSqRob_exp_annotation")
After custom preprocessing, you can transform the data to a protdata
object via the df2protdata
function.
The acc_col
argument in the df2protdata
function indicates the column index or column name of the accessions (i.e. the protein (group) identifiers). The quant_cols
argument is a vector of indices pointing to all columns containing peptide intensities. In the example below, the column named "Proteins" will be added to the accession
slot while all columns with indices in the quant_cols
vector will be added as a matrix to the intensities
slot. quant_name
indicates the name that will be given to the column containing the (preprocessed) intensities and run_name
indicates the name that will be given to the column containing the mass spec run names in the new protdata
object. Like before, we keep "GI.number" and "Protein.names" as annotations.
#Select columns with intensities. quant_cols <- which(grepl("Intensity.", colnames(pepFrancDf2))) quant_cols
#Call df2protdata function proteinsFrancDf <- df2protdata(pepFrancDf2, acc_col="Proteins", quant_cols=quant_cols, quant_name="quant_value", run_name="run", annotations=c("GI.number","Protein.names")) #Inspect proteinsFrancDf protdata object. proteinsFrancDf
Note that the proteinsFrancDf
object is the same as the proteinsFranc
object, only with a different order of columns in their data
slot.
library(MSqRob) library(Biobase)
This example involves the CPTAC Study 6. In this dataset Sigma's UPS1 standard containing 48 human proteins was spiked into a reference yeast proteome and analyzed on 7 instruments in 5 different laboratories (Paulovich et al., 2010). Raw data files can be downloaded at: https://cptac-data-portal.georgetown.edu/cptac/dataPublic/list?currentPath=%2FPhase_I_Data%2FStudy6&nonav=true
Here, we limited ourselves to the data of LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and LTQ-Orbitrap W at site 56. The data was searched with MaxQuant version 1.5.2.8. Detailed search settings can be found in Goeminne et al. (2016).
First, specify the location of the peptides.txt MaxQuant output file. By default, MaxQuant creates a "combined" folder in the same folder as the raw files are (or if the raw files are in different directories, in the directory of the first raw file). The peptides.txt file can then be found in "path_to_raw_files/combined/txt/peptides.txt". Here, we make use of the peptides.txt file that is delivered with the package.
file_peptides_txt <- system.file("extdata/CPTAC", "peptides.txt", package = "MSqRobData")
Import MaxQuant's peptides.txt file and convert it to an MSnSet
object (see package MSnbase
).
peptidesCPTAC <- import2MSnSet(file_peptides_txt, filetype="MaxQuant", remove_pattern=TRUE)
The "pattern" argument indicates which columns in the peptides.txt file contain peptide intensities. Currently, all columns containing "Intensity\ "
in their column names will be treated as peptide intensities (default for MaxQuant). remove_pattern=TRUE
(default) indicates that the name of the pattern is removed from the column names.
Extract the names of the mass spec runs via the sampleNames
function from the Biobase
package.
runs <- sampleNames(peptidesCPTAC)
Note that as the names again start with a number, an "X" is prepended to make the names syntactically valid. Add a grouping factor for the spike-in condition (i.e. 6A, 6B, 6C, 6D and 6E).
condition <- factor(substr(runs, 2,3))
Add a grouping factor for the lab effect (i.e. LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and LTQ-Orbitrap W at site 56).
lab <- factor(rep(rep(1:3,each=3),5)) #Alternative: #lab <- factor((as.numeric(substr(runs, 5,5))-1)%/%3+1)
Create an experimental annotation data frame.
exp_annotation <- data.frame(run=runs, condition=condition, lab=lab) exp_annotation
Remember that the exp_annotation
can also be given as an Excel file or a tab delimited file!
If we want to remove proteins that are only identified by modified peptides from the dataset, we also need the proteinGroups.txt file. This file can be found in the same location as the peptides.txt file ("path_to_raw_files/combined/txt/proteinGroups.txt"). Here, we make use of the proteinGroups.txt file that is delivered with the package.
Note: if you don't want to remove proteins that are only identified by modified peptides, set remove_only_site=FALSE
and leave the file_proteinGroups
argument at its default value (NULL
).
file_proteinGroups <- system.file("extdata/CPTAC", "proteinGroups.txt", package = "MSqRobData")
Important: in the CPTAC dataset, some human UPS proteins are NOT contaminants as these proteins were spiked in on purpose! We only want to remove those contaminants that are not human UPS proteins. We thus need to unmark these proteins as contaminant before preprocessing.
fData(peptidesCPTAC)[grepl("_HUMAN_UPS", fData(peptidesCPTAC)$Proteins),]$Potential.contaminant <- ""
Add to useful_properties
all column names of the fData
slot of the peptidesCPTAC MSnSet
object that you would like to keep (such as gene names, protein names, ontologies, etc.). Other columns will be removed for improved efficiency and memory usage. We certainly want to keep the "Proteins" slot, as this will be our protein (group) identifier. You also want to keep the "Sequence" slot, as peptide-specific effects must be included in our model.
peptidesCPTAC2 <- preprocess_MaxQuant(peptidesCPTAC, accession="Proteins", exp_annotation=exp_annotation, logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("Proteins","Sequence"), filter=c("Potential.contaminant","Reverse"), remove_only_site=TRUE, file_proteinGroups=file_proteinGroups, filter_symbol="+", minIdentified=2)
The following preprocessing steps are executed in this order:
Peptide intensities are log2-transformed (logtransform=TRUE, base=2
).
As a normalization approach, we choose to quantile normalize the peptide intensities (normalisation="quantiles"
). Other options are "sum"
, "max"
, "center.mean"
, "center.median"
, "quantiles.robust"
, "vsn"
or "none"
(see MSnbase::normalize
for more information).
Handling overlapping protein groups (smallestUniqueGroups=TRUE
): in our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup. Peptides that map to protein groups for wich also subgroups are present in the dataset are removed from the dataset.
Remove reverse sequences and contaminants (filter=c("Potential.contaminant","Reverse")
).
Remove all proteins that are only identified by peptides carrying a modification site (remove_only_site=TRUE
, file_proteinGroups=file_proteinGroups
).
Remove all superfluous columns in the fData slot, except the "Proteins" and "Sequence" columns (useful_properties=c("Proteins","Sequence")
).
Remove all peptides that are only identified once in the dataset (minIdentified=2
).
Add experimental annotation.
It is possible to do all the preprocessing steps yourself. That way, you can do more advanced preprocessing or change the order of certain preprocessing steps if you like to.
peptidesCPTAC3 <- log(peptidesCPTAC, base=2) #Alternative: #peptidesCPTAC3 <- log2(peptidesCPTAC3)
After log2-transformation, we want to replace -Inf
values in the peptide intensities by NA
values as these values originate from zeros in the untransformed intensities
matrix, which means these intensities were actually missing.
exprs <- exprs(peptidesCPTAC3) exprs[is.infinite(exprs)] <- NA exprs(peptidesCPTAC3) <- exprs
peptidesCPTAC3 <- normalize(peptidesCPTAC3, "quantiles")
groups2 <- smallestUniqueGroups(fData(peptidesCPTAC3)$Proteins) sel <- fData(peptidesCPTAC3)$Proteins %in% groups2 peptidesCPTAC3 <- peptidesCPTAC3[sel] head(fData(peptidesCPTAC3))
#Remove reverse sequences sel <- fData(peptidesCPTAC3)$Reverse != "+" peptidesCPTAC3 <- peptidesCPTAC3[sel] head(exprs(peptidesCPTAC3)) #Remove contaminants sel <- fData(peptidesCPTAC3)$Potential.contaminant != "+" peptidesCPTAC3 <- peptidesCPTAC3[sel] head(exprs(peptidesCPTAC3))
Small check:
length(unique(fData(peptidesCPTAC3)$Proteins[grepl('_HUMAN_UPS',fData(peptidesCPTAC3)$Proteins)]))
46 out of a total 48 spiked-in human UPS proteins are present in this dataset with at least one peptide sequence identified.
proteinGroups <- read.table(file_proteinGroups, sep="\t", header=TRUE, quote="", comment.char = "") only_site <- proteinGroups$Only.identified.by.site only_site[is.na(only_site)] <- "" removed_proteins <- proteinGroups$Protein.IDs[only_site=="+"] sel <- !(as.character(Biobase::fData(peptidesCPTAC3)[,"Proteins"]) %in% as.character(removed_proteins)) peptidesCPTAC3 <- peptidesCPTAC3[sel]
fData(peptidesCPTAC3) <- fData(peptidesCPTAC3)[,c("Proteins","Sequence")] #, "ups"
minIdentified <- 2 keepers <- rowSums(!is.na(exprs(peptidesCPTAC3)))>=minIdentified peptidesCPTAC3 <- peptidesCPTAC3[keepers] head(exprs(peptidesCPTAC3)) #Remove unused levels fData(peptidesCPTAC3) <- droplevels(fData(peptidesCPTAC3))
rownames(exp_annotation) <- exp_annotation[,"run"] peptidesCPTAC3 <- MSnbase::MSnSet(exprs=exprs(peptidesCPTAC3), fData=fData(peptidesCPTAC3), pData=exp_annotation)
Check whether our manually preprocessed peptidesCPTAC3
is equal to peptidesCPTAC2
.
all(exprs(peptidesCPTAC2)==exprs(peptidesCPTAC3), na.rm=TRUE) all(fData(peptidesCPTAC2)==fData(peptidesCPTAC3), na.rm=TRUE) all(pData(peptidesCPTAC2)==pData(peptidesCPTAC3), na.rm=TRUE)
Caution: this step might take some time! (about 5 minutes on our system). The accession element denotes by which column the individual peptide observations should be grouped.
system.time(proteinsCPTAC <- MSnSet2protdata(peptidesCPTAC2, accession="Proteins"))
When your goal is to analyze the CPTAC data in the recommended way (i.e. a robust ridge model with variance shrinkage and M estimation), your model should be constructed look as follows:
system.time(modelCPTAC_RR <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","lab"), random=c("Sequence","run"), shrinkage.fixed=NULL, weights="Huber", squeezeVar=TRUE))
Caution: model fitting takes some time with large datasets! On our computer, fitting the robust ridge model took about 13 - 18 minutes.
Saving results is completely analogous to the Francisella example.
In this section, we fit different types of models to the CPTAC data, analogous to what has been done in Goeminne et al. (2016) in order to compare their performances. We compare an ordinary least squares (OLS) model with and without M estimation (Huber weights) to a ridge model with and without M estimation. As we also want to assess the impact of the empirical Bayes variance estimation, we set squeezeVar=FALSE
so that no variance squeezing is performed yet.
In order to do an objective comparison between different approaches, we chose here not to include the "run" effect, because this leads to massive overfitting in the ordinary least squares approaches. This effect will mostly be equal to zero anyways because each repeat in the CPTAC dataset is in fact a technical repeat and therefore the variability in peptide intensities between samples will be very small. When doing a standard robust ridge analysis on true biological data, we advise however to include the "run" effect as a random effect. As the simultanous shrinkage of fixed effects was not yet implemented at the time, we provide "lab" as a random effect in order to get results that are as close as possible to those of Goeminne et al. (2016).
Fit an OLS model (shrinkage.fixed=c(0,0,0,0)
) without M estimation (weights=NULL
).
system.time(modelCPTAC_Lm <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","Sequence","lab"), random=NULL, shrinkage.fixed=c(0,0,0,0), weights=NULL, squeezeVar=FALSE))
Fit an OLS model (shrinkage.fixed=c(0,0,0,0)
) with M estimation (weights=Huber
).
system.time(modelCPTAC_Lm_M <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","Sequence","lab"), random=NULL, shrinkage.fixed=c(0,0,0,0), weights="Huber", squeezeVar=FALSE))
Fit a ridge model (shrinkage.fixed=c(0,1)
) without M estimation (weights=NULL
).
system.time(modelCPTAC_Ridge <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed="condition", random=c("Sequence","lab"), shrinkage.fixed=c(0,1), weights=NULL, squeezeVar=FALSE))
Fit a ridge model (shrinkage.fixed=c(0,1)
) with M estimation (weights=Huber
).
system.time(modelCPTAC_Ridge_M <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed="condition", random=c("Sequence","lab"), shrinkage.fixed=c(0,1), weights="Huber", squeezeVar=FALSE))
As you noticed, an OLS model can be fit by setting shrinkage.fixed=c(0,0,0,0)
. This indicates that neither the intercept and none of our three fixed effects should be shrunken. The ridge model is here specified by setting shrinkage.fixed=c(0,1)
. This indicates that the intercept should not be shrunken, but the fixed effect "condition"
should be shrunken. Alternatively, one could also set shrinkage.fixed=NULL
as the shrinkage of all fixed effects except for the intercept is the default behaviour of MSqRob.
attr(getModels(modelCPTAC_RR[1]),"MSqRob_levels")
MSqRob always tries to fit a model, but some models are overparameterized because too many parameters are fit on too few observations. These models have convergence problems and can be removed from the data prior to estimating p-values. This is only relevant for models that perform shrinkage or use any kind of random effect.
Extract the data as a list from the protdata object proteinsCPTAC
.
data <- getData(proteinsCPTAC)
convergence <- lapply(data, function(x){return(tryCatch(lme4::lFormula(formula("quant_value~1+(1|condition)+(1|lab)+(1|Sequence)"),x,control = lme4::lmerControl()), error=function(e){ return(NA) }))})
Get the indices of these overparameterized models.
na_indices <- which(is.na(convergence))
See which proteins you removed.
getAccessions(proteinsCPTAC)[na_indices]
Both UPS proteins in this list are only identified by one peptide.
getData(proteinsCPTAC["P69905ups|HBA_HUMAN_UPS"]) getData(proteinsCPTAC["P41159ups|LEP_HUMAN_UPS"])
Remove the overparameterized models from the protLM object modelCPTAC_RR
and from the data.
modelCPTAC_RR_cleaned <- modelCPTAC_RR[-na_indices] data <- data[-na_indices]
Construct the contrast matrix L for inferring on the fold changes of interest.
L <- makeContrast(contrasts=c("condition6B-condition6A","condition6C-condition6A","condition6D-condition6A","condition6E-condition6A","condition6C-condition6B","condition6D-condition6B","condition6E-condition6B","condition6D-condition6C","condition6E-condition6C","condition6E-condition6D"), levels=c("condition6A","condition6B","condition6C","condition6D","condition6E"))
Make the DA rankings for the ordinary least squares model (no ridge, no squeezed variances, no M estimation).
#modelCPTAC_Lm <- squeezePars(modelCPTAC_Lm, squeezeVar=FALSE) lm_normal <- test.protLMcontrast(modelCPTAC_Lm, L) lm_normal <- prot.p.adjust(lm_normal) lm_normal <- prot.signif(lm_normal)
Make the DA rankings for the ordinary least squares model with squeezed variances (no ridge, no M estimation).
modelCPTAC_Lm_Sq <- squeezePars(modelCPTAC_Lm, squeezeVar=TRUE) lm_Sq <- test.protLMcontrast(modelCPTAC_Lm_Sq, L) lm_Sq <- prot.p.adjust(lm_Sq) lm_Sq <- prot.signif(lm_Sq)
Make the DA rankings for the ordinary least squares model with squeezed variances and M estimation (no ridge).
modelCPTAC_Lm_M_Sq <- squeezePars(modelCPTAC_Lm_M, squeezeVar=TRUE) lm_SqM <- test.protLMcontrast(modelCPTAC_Lm_M_Sq, L) lm_SqM <- prot.p.adjust(lm_SqM) lm_SqM <- prot.signif(lm_SqM)
Make the DA rankings for the ridge regression model (no squeezed variances, no M estimation).
#modelCPTAC_Ridge <- squeezePars(modelCPTAC_Ridge, squeezeVar=FALSE) ridge <- test.protLMcontrast(modelCPTAC_Ridge, L) ridge <- prot.p.adjust(ridge) ridge <- prot.signif(ridge)
Make the DA rankings for the ridge regression model with squeezed variances (no M estimation).
modelCPTAC_Ridge_Sq <- squeezePars(modelCPTAC_Ridge, squeezeVar=TRUE) ridgeSq <- test.protLMcontrast(modelCPTAC_Ridge_Sq, L) ridgeSq <- prot.p.adjust(ridgeSq) ridgeSq <- prot.signif(ridgeSq)
Make the DA rankings for the ridge regression model with squeezed variances and M estimation. This is our robust ridge (RR) approach (the default MSqRob approach, although we included "lab" here as a random effect and we did not include a "run" effect).
modelCPTAC_Ridge_M_Sq <- squeezePars(modelCPTAC_Ridge_M, squeezeVar=TRUE) ridgeSqM <- test.protLMcontrast(modelCPTAC_Ridge_M, L) ridgeSqM <- prot.p.adjust(ridgeSqM) ridgeSqM <- prot.signif(ridgeSqM)
Our standard approach is to calculate the degrees of freedom via the trace of the Hat matrix. One could as well use custom degrees of freedom, for example in order to work more conservatively. Indeed, if we consider the peptides as pseudo-replicates, our experimental units are the different spike-in conditions in each lab. The number of degrees of freedom is then the number of combinations lab x condition minus the number of parameters estimated for lab (i.e. the number of labs minus 1) minus the number of parameters estimated for condition (i.e. the number of conditions minus 1) minus 1 for the intercept. We show an example for the modelCPTAC_RR
model.
custom_dfs <- sapply(data, function(x){return(length(unique(paste0(x$condition,x$lab)))-(length(unique(x$condition))-1)-(length(unique(x$lab))-1)-1)}) custom_dfs ridgeSqM_custom <- test.protLMcontrast(modelCPTAC_RR, L, custom_dfs = custom_dfs) ridgeSqM_custom <- prot.p.adjust(ridgeSqM_custom) ridgeSqM_custom <- prot.signif(ridgeSqM_custom)
In order to be able to make diagnostic plots and tables, load the required functions from our GitHub page.
source("https://raw.githubusercontent.com/statOmics/MSqRob/master/vignettes/plotfunctions.R")
Extra: add a column containing TRUE if the corresponding protein is a spiked-in UPS1 protein and FALSE if it is a yeast protein. For this, we use the addTPcol
function. pattern
indicates for the pattern in the protein name that points to a true positive and TPcolname
is the name to be given to the added column.
lm_normal <- addTPcol(lm_normal, pattern="UPS", TPcolname="ups") lm_Sq <- addTPcol(lm_Sq, pattern="UPS", TPcolname="ups") lm_SqM <- addTPcol(lm_SqM, pattern="UPS", TPcolname="ups") ridge <- addTPcol(ridge, pattern="UPS", TPcolname="ups") ridgeSq <- addTPcol(ridgeSq, pattern="UPS", TPcolname="ups") ridgeSqM <- addTPcol(ridgeSqM, pattern="UPS", TPcolname="ups")
The makeSummary
function allows to evaluate the performance of all methods on the CPTAC dataset. For this, we need to create a vector that contains the theoretical log2 fold changes of the spiked-in UPS1 proteins.
upratio <- c(1.565597,3.137504,4.744161,6.321928,1.571906, 3.178564,4.756331,1.60665,3.18442,1.577767)
Create the summary objects.
summary_lm <- makeSummary(lm_normal, upratio, TPcol="ups") summary_lmSq <- makeSummary(lm_Sq, upratio, TPcol="ups") summary_lmSqM <- makeSummary(lm_SqM, upratio, TPcol="ups") summary_ridge <- makeSummary(ridge, upratio, TPcol="ups") summary_ridgeSq <- makeSummary(ridgeSq, upratio, TPcol="ups") summary_ridgeSqM <- makeSummary(ridgeSqM, upratio, TPcol="ups")
Results
summary_lm summary_lmSq summary_lmSqM summary_ridge summary_ridgeSq summary_ridgeSqM
colors <- c(1,2,3,4,5,6) resultobjectlist <- list(lm_normal,lm_Sq,lm_SqM,ridge,ridgeSq,ridgeSqM) truth <- lm_normal[[1]][,"ups"] names(truth) <- rownames(lm_normal[[1]]) summary_objectlist <- lapply(resultobjectlist, function(x){return(makeSummary(x, upratio))}) plotROC(resultobjectlist, summary_objectlist=summary_objectlist, proteins=proteinsCPTAC, truth=truth, colors=colors, plotSVG=FALSE)
Argentini, A., Goeminne, L. J. E., Verheggen, K., Hulstaert, N., Staes, A., Clement, L. and Martens, L. (2016) moFF: a robust and automated approach to extract peptide ion intensities. Nature Methods 13, pp 964–966.
Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003) A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2), pp 185-193. http://bmbolstad.com/misc/normalize/normalize.html
Goeminne, L. J. E., Argentini, A., Martens, L., and Clement, L. (2015) Summarization vs Peptide-Based Models in Label-Free Quantitative Proteomics: Performance, Pitfalls, and Data Analysis Guidelines. Journal of Proteome Research 14(6), pp 2457-2465.
Goeminne, L. J. E., Gevaert, K., and Clement, L. (2016) Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics. Molecular & Cellular Proteomics 15(2), pp 657-668.
Griss, J., Jones, A.R., Sachsenberg, T., Walzer, M., Gatto, L., Hartler, J., Thallinger, G.G., Salek, R.M., Steinbeck, C., Neuhauser, N., Cox, J., Neumann, S., Fan, J., Reisinger, F., Xu, Q.W., Del Toro, N., Pérez-Riverol, Y., Ghali, F., Bandeira, N., Xenarios, I., Kohlbacher, O., Vizcaíno, J.A. & Hermjakob, H. (2014) The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. Molecular and Cellular Proteomics 13(10), pp 2765-75.
Navarro P., Kuharev J., Gillet, L. C., Bernhardt, O. M., MacLean, B., Röst, H. L., Tate, S. A., Tsou, C., Reiter, L., Distler, U., Rosenberger, G., Perez-Riverol, Y., Nesvizhskii, A. I., Aebersold, R., and Tenzer, S. (2016) A multicenter study benchmarks software tools for label-free proteome quantification. Nature Biotechnology.
Paulovich AG, Billheimer D, Ham A-JL, et al. (2010) Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance. Molecular & Cellular Proteomic 9(2), pp 242-254.
Ramond, E., Gesbert, G., Guerrera, I. C., Chhuon, C., Dupuis, M., Rigard, M., Henry, T., Barel, M., and Charbit, A. (2015) Importance of host cell arginine uptake in Francisella phagosomal escape and ribosomal protein amounts. Molecular & Cellular Proteomics 14, pp 870–881.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.