suppressPackageStartupMessages({
  library("BiocStyle")
  library("protrusionproteome")
  library(knitr)
  library(dplyr)
  library(lubridate)
  library(tidyverse)
})

\pagebreak

Abstract

protrusionproteome is a package that provides an analytical workflow of shotgun mass spectrometry-based proteomics experiments with tandem mass tag (TMT) labelingof protrusion profiling experiments [@Dermit2020]. Protrusion are purified from microporous transwell filters and their proteomes are compare to the cell bodies.

protrusionproteome requires tabular input e.g. proteinGroups.txt, peptides.txt and evidence.txt output of quantitative analysis software like MaxQuant. Functions are provided for preparation, filtering as well as log transformation,calculating TMT ratios and median substation as well as generating SummarizedExperiment objects. It also includes tools to visualize protrusion purification efficiency, protease miscleavages and TMT incorporation efficiency with visualization such as scatterplot and barplot representations. Finally, it includes statistical testing of significantly enrich categories in cell protrusions.

Installation and loading package

Start R and install the protrusionproteome package:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("protrusionproteome")
library("protrusionproteome")

Once you have the package installed, load protrusionproteome and dplyr for data transformation into R.

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
library(protrusionproteome)
library(dplyr)
library(SummarizedExperiment)
library(lubridate)
library(stringr)

Data loading, QC and processing

We analyze the time-course protrusion profiles proteome dataset from [@Dermit2020] (PRIDE PXD021239), which is provided within this package. The raw mass spectrometry data were first analyzed using MaxQuant [@Cox2014]. We initially perform a number of data quality checks including trypsin efficiency and TMT labeling incorporation.

Calculating protease efficiency

Tryspsin was the protease used to produce tryptic peptides for this dataset. Note that the maximum missed cleavages allowed on the MaxQuant search was 2. The numbers of trypsin miscleavages can be used as a proxy of trypsin efficiency. This information is contained in the peptides.txt file, and it is provided with this package:

# Peptides data is provided with the package
data("peptides.raw")
pepdata <- peptides.raw

To visualize trypsin efficiency, the plot_miscleavagerate function can be used.

# Stacked barplot of trypsin miscleavages
plot_miscleavagerate(pepdata)

Calculating TMT labeling efficiency

TMT label efficiency can be measured as a proxy of peptides modified by TMT. This information is contained in the evidence.txt file, and is provided with this package:

#Evidence data is provided with the package
data("evidence.raw")
evidencedata <- evidence.raw

To visualize TMT label efficiency, we can use plot_labelingefficiency function.

 plot_labelingefficiency(evidencedata)

Loading protein data.

Protein intensities are obtained from aggregated peptides over protein groups. This information is contained in the proteinGroups.txt file, is provided with this package and is used as input for the downstream analysis.

# The data is provided with the package
data("prot.raw")
data <- prot.raw

This dataset has the following dimensions:

dim(data)

Filtering the data

We filter for decoy database hits, contaminant proteins and hits only identified by site, which are indicated by "+" in the columns "Reverse","Potential.contaminants" and "Only.identified.by.site", respectively using filter_MaxQuant function.

proteins_filtered <- filter_MaxQuant(data,
    tofilter= c("Reverse" , "Potential.contaminant" ,"Only.identified.by.site"))

We can see that the filtered data has following dimensions:

dim(proteins_filtered)

Making unique identifiers

Uniprot names are protein unique identifiers but are not immediately informative. The associated Gene names are informative, however these are not always unique (i.e Gene.names is not primary key to the proteins_filtered table).

# Are Gene.names primary key to this table?
proteins_filtered %>% group_by(Gene.names) %>% summarize(count= n()) %>% 
  arrange(desc( count)) %>% filter( count > 1)

Even more critical, some proteins do not have an annotated Gene name. Similarly to [@Arne2018] approach, for those proteins missing a Gene name identifier we will use the Uniprot ID.

``` {r make-unique} data_unique <- make_unique(proteins_filtered, "Gene.names", "Protein.IDs", delim = ";")

We can check that name variable uniquely identifies proteins.
``` {r data-unique-cheking-duplicated-Gene-names}
data_unique %>% group_by(name) %>% tally() %>% filter(n > 1) %>% nrow()

SummarizedExperiment proteomics

Experimental design

The columns that will be used in the summarised experiment are Reporter.intensity.corrected columns. Below is the table of how the samples were labeled in this experiment.

columns_positions<-str_which(colnames(data_unique),"Reporter.intensity.corrected.(\\d)+.(\\d)")
intensities <- colnames(data_unique)[str_which(colnames(data_unique),"Reporter.intensity.corrected.(\\d)+.(\\d)")]
time_unit=30
time_span=c(1,2,4,8,16)
experiment <- str_c(minute(rep(minutes(x = time_unit) *time_span,each=2)),c("body","prot"),sep="_")
knitr::kable(tibble(intensities,experiment))

SummarizedExperiment of TMT data

SummarizedExperiment objects [@Morgan2020] are widely used across Bioconductor packages as data containers. This class of object contains the actual data (assays), information on the samples (colData) and additional feature annotation. We generate the SummarizedExperiment object from our data extracting information directly from the column names of rectangular data using the make_TMT_se function. The actual assay data is log2-transformed of median-subtracted Prot/Cell-bodies for each condition.

``` {r create-TMT-se, warning=FALSE, message=FALSE}

Generate a SummarizedExperiment object using file information and user input

columns_positions<-str_which(colnames(data_unique),"Reporter.intensity.corrected.(\d)+.(\d)") se <- make_TMT_se(data_unique,columns_positions,intensities,time_unit=30, time_span=c(1,2,4,8,16), numerator= "prot", denominator= "body", sep = "_")

Let's have a look at the SummarizedExperiment object
```r
se

As we see the number of columns has been reduced to half (from 10 column sample data to 5 column sample data containing log2 Prot/Cell-bodies ratios)

Exploring the description of each sample in the SummarizedExperiment object

SummarizedExperiment::colData(se)

Vizualing the results

We can visualise the Prot/Cell-bodies distribution across timepoints.

plot_scatter(se, 1, 2, "HIST", 'orange', 4, 4)

Histone proteins are depleted from cell body.

1D enrichment analysis

We can perform a protein enrichment analysis for a given timepoint

enrichment_table <- enrich_1D(se,timepoint= 5,
                               dbs = "GO_Molecular_Function_2018",
                              number_dbs=1)

Session information

{r session_info, echo = FALSE} sessionInfo()

References {-}



demar01/protrusionproteome documentation built on April 29, 2021, 5:47 a.m.