suppressPackageStartupMessages({
  library("BiocStyle")
  library("DEP")
  library("dplyr")
})

\pagebreak

Overview of the analysis

This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as MaxQuant or IsobarQuant. Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report (see the chapter on Workflow functions). Even easier to use are the interactive Shiny apps that are provided by the package (see the chapter on Shiny apps).

Installation

The most recent version of R can be downloaded from CRAN. Additional system requirements are Pandoc, as well as Rtools for Windows and NetCDF for Linux.

Start R and install the DEP package:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("DEP")

library("DEP")

Interactive analysis using the DEP Shiny apps

The package contains r CRANpkg("shiny") apps, which allow for the interactive analysis entailing the entire workflow described below. These apps are especially relevant for users with limited or no experience in R. Currently, there are two different apps. One for label-free quantification (LFQ)-based analysis (output from MaxQuant) and the second for tandem-mass-tags (TMT)-based analysis (output from IsobarQuant). To run the apps, simply run this single command:

``` {r run_app, eval = FALSE}

For LFQ analysis

run_app("LFQ")

For TMT analysis

run_app("TMT")

# Differential analysis

## Example dataset: Ubiquitin interactors

We analyze a proteomics dataset in which Ubiquitin-protein interactors were characterized ([Zhang et al. Mol Cell 2017](http://www.cell.com/molecular-cell/fulltext/S1097-2765(17)30004-7)).
The raw mass spectrometry data were first analyzed using MaxQuant ([Cox and Mann, Nat Biotech 2007](http://www.nature.com/nbt/journal/v26/n12/full/nbt.1511.html)) and the resulting "proteinGroups.txt" file is used as input for the downstream analysis.

## Loading of the data

```r
# Loading a package required for data handling
library("dplyr")

# The data is provided with the package
data <- UbiLength

# We filter for contaminant proteins and decoy database hits, which are indicated by "+" in the columns "Potential.contaminants" and "Reverse", respectively. 
data <- filter(data, Reverse != "+", Potential.contaminant != "+")

The data.frame has the following dimensions:

``` {r dimension} dim(data)

The data.frame has the following column names:

``` {r colnames}
colnames(data)

The "LFQ.intensity" columns will be used for subsequent analysis.

Data preparation

The dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique.

``` {r unique}

Are there any duplicated gene names?

data$Gene.names %>% duplicated() %>% any()

Make a table of duplicated gene names

data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>% arrange(desc(frequency)) %>% filter(frequency > 1)

For further analysis these proteins must get unique names.
Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.

``` {r unique_names}
# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name.
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")

# Are there any duplicated names?
data$name %>% duplicated() %>% any()

Generate a SummarizedExperiment object

Many Bioconductor packages use r Biocpkg("SummarizedExperiment") objects as input and/or output. This class of objects contains and coordinates the actual (assay) data, information on the samples as well as feature annotation. We can generate the SummarizedExperiment object from our data using two different approaches. We can extract sample information directly from the column names or we add sample information using an experimental design template. An experimental design is included in the package for our example dataset.

The experimental design must contain 'label', 'condition' and 'replicate' columns. The 'label' column contains the identifiers of the different samples and they should correspond to the column names containing the assay data. The 'condition' and 'replicate' columns contain the annotation of these samples as defined by the user.

``` {r expdesign, echo = FALSE}

Display experimental design

knitr::kable(UbiLength_ExpDesign)

``` {r to_exprset}
# Generate a SummarizedExperiment object using an experimental design
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
experimental_design <- UbiLength_ExpDesign
data_se <- make_se(data_unique, LFQ_columns, experimental_design)

# Generate a SummarizedExperiment object by parsing condition information from the column names
LFQ_columns <- grep("LFQ.", colnames(data_unique)) # get LFQ column numbers
data_se_parsed <- make_se_parse(data_unique, LFQ_columns)

# Let's have a look at the SummarizedExperiment object
data_se

Prerequisites of the SummarizedExperiment object

The make_se and make_se_parse functions generate a r Biocpkg("SummarizedExperiment") object that has a couple of specifications. The assay data is log2-transformed and its rownames depict the protein names. The rowData contains, amongst others, the 'name' and 'ID' columns that were generated by make_unique. The colData contains the experimental design and thereby the sample annotation. Thereby the colData includes the 'label', 'condition' and 'replicate' columns as well as a newly generated 'ID' column. The log2-transformed assay data and the specified rowData and colData columns are prerequisites for the subsequent analysis steps.

Filter on missing values

The dataset contains proteins which are not quantified in all replicates. Some proteins are even only quantified in a single replicate.

``` {r plot_data_noFilt, fig.width = 4, fig.height = 4}

Plot a barplot of the protein identification overlap between samples

plot_frequency(data_se)

This leaves our dataset with missing values, which need to be imputed. 
However, this should not be done for proteins that contain too many missing values.
Therefore, we first filter out proteins that contain too many missing values. 
This is done by setting the threshold for the allowed number of missing values per condition in the _filter_missval_ function.

``` {r filter_missval}
# Filter for proteins that are identified in all replicates of at least one condition
data_filt <- filter_missval(data_se, thr = 0)

# Less stringent filtering:
# Filter for proteins that are identified in 2 out of 3 replicates of at least one condition
data_filt2 <- filter_missval(data_se, thr = 1)

After filtering, the number of identified proteins per sample can be plotted as well as the overlap in identifications between samples.

``` {r plot_data, fig.width = 4, fig.height = 4}

Plot a barplot of the number of identified proteins per samples

plot_numbers(data_filt)

``` {r plot_data2, fig.width = 3, fig.height = 4}
# Plot a barplot of the protein identification overlap between samples
plot_coverage(data_filt)

Normalization

The data is background corrected and normalized by variance stabilizing transformation (r Biocpkg("vsn")).

``` {r normalize}

Normalize the data

data_norm <- normalize_vsn(data_filt)

The normalization can be inspected by checking the distributions of the samples before and after normalization.

``` {r plot_norm, fig.width = 4, fig.height = 5}
# Visualize normalization by boxplots for all samples before and after normalization
plot_normalization(data_filt, data_norm)

Impute data for missing values

The remaining missing values in the dataset need to be imputed. The data can be missing at random (MAR), for example if proteins are quantified in some replicates but not in others. Data can also be missing not at random (MNAR), for example if proteins are not quantified in specific conditions (e.g. in the control samples). MNAR can indicate that proteins are below the detection limit in specific samples, which could be very well the case in proteomics experiments. For these different conditions, different imputation methods have to be used, as described in the r Biocpkg("MSnbase") vignette and more specifically in the impute function descriptions.

To explore the pattern of missing values in the data, a heatmap is plotted indicating whether values are missing (0) or not (1). Only proteins with at least one missing value are visualized.

``` {r plot_missval, fig.height = 4, fig.width = 3}

Plot a heatmap of proteins with missing values

plot_missval(data_filt)

This heatmap indicates that missing values are highly biased to specific samples.
The example dataset is an affinity enrichment dataset of ubiquitin interactors, which is likely to have proteins which are below the detection limit in specific samples.
These can be proteins that are specifically enriched in the ubiquitin purifications, but are not enriched in the controls samples, or vice versa.
To check whether missing values are biased to lower intense proteins, the densities and cumulative fractions are plotted for proteins with and without missing values.   

``` {r plot_detect, fig.height = 4, fig.width = 4}
# Plot intensity distributions and cumulative fraction of proteins with and without missing values
plot_detect(data_filt)

Indeed the proteins with missing values have on average low intensities. This data (MNAR and close to the detection limit) should be imputed by a left-censored imputation method, such as the quantile regression-based left-censored function ("QRILC") or random draws from a left-shifted distribution ("MinProb" and "man"). In contrast, MAR data should be imputed with methods such as k-nearest neighbor ("knn") or maximum likelihood ("MLE") functions. See the r Biocpkg("MSnbase") vignette and more specifically the impute function description for more information.

``` {r impute, results = "hide", message = FALSE, warning = FALSE, error = TRUE}

All possible imputation methods are printed in an error, if an invalid function name is given.

impute(data_norm, fun = "")

Impute missing data using random draws from a Gaussian distribution centered around a minimal value (for MNAR)

data_imp <- impute(data_norm, fun = "MinProb", q = 0.01)

Impute missing data using random draws from a manually defined left-shifted Gaussian distribution (for MNAR)

data_imp_man <- impute(data_norm, fun = "man", shift = 1.8, scale = 0.3)

Impute missing data using the k-nearest neighbour approach (for MAR)

data_imp_knn <- impute(data_norm, fun = "knn", rowmax = 0.9)

The effect of the imputation on the distributions can be visualized.

``` {r plot_imp, fig.width = 4, fig.height = 4}
# Plot intensity distributions before and after imputation
plot_imputation(data_norm, data_imp)

Differential enrichment analysis

Protein-wise linear models combined with empirical Bayes statistics are used for the differential enrichment analysis (or differential expression analysis). The test_diff function introduced here uses r Biocpkg("limma") and automatically generates the contrasts to be tested. For the contrasts generation, the control sample has to be specified. Additionally, the types of contrasts to be produced need to be indicated, allowing the generation of all possible comparisons ("all") or the generation of contrasts of every sample versus control ("control"). Alternatively, the user can manually specify the contrasts to be tested (type = "manual"), which need to be specified in the argument test.

``` {r statistics}

Differential enrichment analysis based on linear models and empherical Bayes statistics

Test every sample versus control

data_diff <- test_diff(data_imp, type = "control", control = "Ctrl")

Test all possible comparisons of samples

data_diff_all_contrasts <- test_diff(data_imp, type = "all")

Test manually defined comparisons

data_diff_manual <- test_diff(data_imp, type = "manual", test = c("Ubi4_vs_Ctrl", "Ubi6_vs_Ctrl"))

Finally, significant proteins are defined by user-defined cutoffs using _add_rejections_.

``` {r add_reject}
# Denote significant proteins based on user defined cutoffs
dep <- add_rejections(data_diff, alpha = 0.05, lfc = log2(1.5))

Visualization of the results

The results from the previous analysis can be easily visualized by a number of functions. These visualizations assist in the determination of the optimal cutoffs to be used, highlight the most interesting samples and contrasts, and pinpoint differentially enriched/expressed proteins.

PCA plot

The PCA plot can be used to get a high-level overview of the data. This can be very useful to observe batch effects, such as clear differences between replicates.

``` {r pca, fig.height = 3, fig.width = 4}

Plot the first and second principal components

plot_pca(dep, x = 1, y = 2, n = 500, point_size = 4)

### Correlation matrix

A correlation matrix can be plotted as a heatmap, to visualize the Pearson correlations between the different samples.

``` {r corr, fig.height = 3, fig.width = 5}
# Plot the Pearson correlation matrix
plot_cor(dep, significant = TRUE, lower = 0, upper = 1, pal = "Reds")

Heatmap of all significant proteins

The heatmap representation gives an overview of all significant proteins (rows) in all samples (columns). This allows to see general trends, for example if one sample or replicate is really different compared to the others. Additionally, the clustering of samples (columns) can indicate closer related samples and clustering of proteins (rows) indicates similarly behaving proteins. The proteins can be clustered by k-means clustering (kmeans argument) and the number of clusters can be defined by argument k.

``` {r heatmap, fig.height = 5, fig.width = 3}

Plot a heatmap of all significant proteins with the data centered per protein

plot_heatmap(dep, type = "centered", kmeans = TRUE, k = 6, col_limit = 4, show_row_names = FALSE, indicate = c("condition", "replicate"))

The heatmap shows a clustering of replicates and indicates that 4Ubi and 6Ubi enrich a similar repertoire of proteins. The k-means clustering of proteins (general clusters of rows) nicely separates protein classes with different binding behaviors.  

***

Alternatively, a heatmap can be plotted using the contrasts, i.e. the direct sample comparisons, as columns. Here, this emphasises the enrichment of ubiquitin interactors compared to the control sample.

``` {r heatmap2, fig.height = 5, fig.width = 3}
# Plot a heatmap of all significant proteins (rows) and the tested contrasts (columns)
plot_heatmap(dep, type = "contrast", kmeans = TRUE, 
             k = 6, col_limit = 10, show_row_names = FALSE)

Volcano plots of specific contrasts

Volcano plots can be used to visualize a specific contrast (comparison between two samples). This allows to inspect the enrichment of proteins between the two samples (x axis) and their corresponding adjusted p value (y axis). The add_names argument can be set to FALSE if the protein labels should be omitted, for example if there are too many names.

``` {r volcano, fig.height = 5, fig.width = 5}

Plot a volcano plot for the contrast "Ubi6 vs Ctrl""

plot_volcano(dep, contrast = "Ubi6_vs_Ctrl", label_size = 2, add_names = TRUE)

### Barplots of a protein of interest

It can also be useful to plot the data of a single protein, for example if this protein is of special interest.

``` {r bar, fig.height = 3.5, fig.width = 3.5}
# Plot a barplot for USP15 and IKBKG
plot_single(dep, proteins = c("USP15", "IKBKG"))

# Plot a barplot for the protein USP15 with the data centered
plot_single(dep, proteins = "USP15", type = "centered")

Frequency plot of significant proteins and overlap of conditions

Proteins can be differentially enriched/expressed in multiple comparisons. To visualize the distribution of significant conditions per protein and the overlap between conditions, the plot_cond function can be used.

``` {r overlap, fig.height = 4, fig.width = 6}

Plot a frequency plot of significant proteins for the different conditions

plot_cond(dep)

## Results table

To extract a table containing the essential results, the _get_results_ function can be used.

``` {r results_table}
# Generate a results table
data_results <- get_results(dep)

# Number of significant proteins
data_results %>% filter(significant) %>% nrow()

The resulting table contains the following columns:

``` {r results_table2}

Column names of the results table

colnames(data_results)

Of these columns, the __p.val__ and __p.adj__ columns contain the raw and adjusted p values, respectively, for the contrast as depicted in the column name. The __ratio__ columns contain the average log2 fold changes. The __significant__ columns indicate whether the protein is differentially enriched/expressed, as defined by the chosen cutoffs. The __centered__ columns contain the average log2 fold changes scaled by protein-wise centering.

## Generate a data.frame from the resulting SummarizedExperiment object

You might want to obtain an ordinary data.frame of the results.
For this purpose, the package provides functions to convert SummarizedExperiment objects to data.frames.
_get_df_wide_ will generate a wide table, whereas _get_df_long_ will generate a long table.

``` {r get_df}
# Generate a wide data.frame
df_wide <- get_df_wide(dep)
# Generate a long data.frame
df_long <- get_df_long(dep)

Save your results object for reuse

To facilitate future analysis and/or visualization of your current data, saving your analyzed data is highly recommended. We save the final data object (dep) as well as intermediates of the analysis, i.e. the initial SummarizedExperiment object (data_se), normalized data (data_norm), imputed data (data_imp) and differentially expression analyzed data (data_diff). This allows us to easily change parameters in future analysis.

``` {r save_load, eval = FALSE}

Save analyzed data

save(data_se, data_norm, data_imp, data_diff, dep, file = "data.RData")

These data can be loaded in future R sessions using this command

load("data.RData")

# Workflow functions for the entire analysis

The package contains workflow functions that entail the complete analysis and generate a report. 

## LFQ-based DEP analysis

Differential enrichment analysis of label-free proteomics data can be performed using the _LFQ_ workflow function.

``` {r LFQ, message = FALSE}
# The data is provided with the package 
data <- UbiLength
experimental_design <- UbiLength_ExpDesign

# The wrapper function performs the full analysis
data_results <- LFQ(data, experimental_design, fun = "MinProb", 
      type = "control", control = "Ctrl", alpha = 0.05, lfc = 1)

This wrapper produces a list of objects, which can be used to create a report and/or for further analysis. The report function produces two reports (pdf and html), a results table and a RData object, which are saved in a generated "Report" folder.

``` {r report2, eval = FALSE}

Make a markdown report and save the results

report(data_results)

The results generated by the LFQ function contain, among others, the _results_ object (data.frame object) and the _dep_ object (SummarizedExperiment object).

``` {r report1}
# See all objects saved within the results object
names(data_results)

The results object contains the essential results and the dep object contains the full SummarizedExperiment object. The results table can be explored by selecting the $results object

``` {r LFQ_results3}

Extract the results table

results_table <- data_results$results

Number of significant proteins

results_table %>% filter(significant) %>% nrow()

The full data (_dep_ object) can be used for the plotting functions as described in the chapter ["Visualization of the results"](#visualization-of-the-results), for example a heatmap.

``` {r LFQ_results4, fig.height = 5, fig.width = 3}
# Extract the sign object
full_data <- data_results$dep

# Use the full data to generate a heatmap
plot_heatmap(full_data, type = "contrast", kmeans = TRUE, 
             k = 6, col_limit = 4, show_row_names = FALSE)

TMT-based DEP analysis

Differential enrichment analysis of Tandem-Mass-Tag labeled proteomics data is also supported. Protein results tables from IsobarQuant can be directly analyzed using the TMT wrapper function.

``` {r TMT, eval = FALSE}

Need example data

TMTdata <- example_data Exp_Design <- example_Exp_Design

The wrapper function performs the full analysis

TMTdata_results <- TMT(TMTdata, expdesign = Exp_Design, fun = "MinProb", type = "control", control = "Control", alpha = 0.05, lfc = 1)

# Session information

``` {r session_info, echo = FALSE}
sessionInfo()


arnesmits/DEP documentation built on Aug. 7, 2019, 10:44 a.m.