knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## cf https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
In this tutorial, we'll walk you through the process of modelling single-cell proteomics (SCP) data using the scplainer approach (@Vanderaa2023-scplainer). By the end of this vignette, you will be able to:
The last point will allow you to generate SCP data that is suitable for downstream analysis, such as clustering or trajectory inference. The figure below provides a roadmap of the workflow:
The vignette will start with the processed data extracted as a
SingleCellExperiment
object from a processed QFeatures
object. We
will not cover data processing as it is covered in another vignette.
library("scp") library("SingleCellExperiment") library("patchwork") library("ggplot2")
The example data set is a subset of the leduc2022_pSCoPE
data set
(see ?scpdata::leduc2022_pSCoPE
for more info). The data is acquired
using TMT-18 multiplexing and data-dependent acquisition (DDA). The
data has been processed using a minimal workflow:
We suggest using this minimal processing workflow, although the
approach presented here is agnostic of previous processing and allows
for other custom workflows. The data processing was conducted with
QFeatures
and scp
.
data("leduc_minimal") leduc_minimal
The data set is formatted as a SingleCellExperiment
object. The data
set consists of 200 peptides and 73 cells. Peptide annotations can be
retrieved from the rowData
and cell annotations can be retrieved
from the colData
. The cell annotation will be used during modelling.
A full reanalysis of Leduc's nPOP dataset is also available here.
The core of the approach relies on statistical modelling of the data
using linear regression. Under the hood, the model fetches as
input the intensity matrix stored in assay(leduc_minimal)
.
The cell annotations are retrieved using colData(leduc_minimal)
.
They describe known technical and biological variables that may
influence the acquired peptide intensities. The annotations are used
to build a regression model with $p$ parameters. Then, the model
estimates the coefficients. Coefficients provide the contributions of
each parameter to the expression of each of peptide as well as the
uncertainty of the estimation. These will be explored in the following
section.
We'll start by defining the variables to include in the model. Recall that the example data set contains TMT-labeled cells. This means that each MS acquisition run contains multiple cells. Each run is subject to technical fluctuations that can lead to undesired variation, this is known as a batch effect.
table(leduc_minimal$Set)
The labelling reagent (Channel
) can also
lead to undesired systematic effects and will also be considered as a
source for batch effects.
table(leduc_minimal$Channel)
Finally, each cell is processed individually and the amount of peptide
material recovered from each cell may lead to undesired variation as
well. This issue is usually solved through normalization, such as
removing the median intensity from each cell. Normalization was
internationally omitted in the minimal data processing so that we can
account for it during modelling. The median intensity were already
computed (MedianIntensity
).
hist(leduc_minimal$MedianIntensity, breaks = 10)
Finally, the biological variable of interest in the example data set
is the cell type that is known because cells come from 2 cell lines
(SampleType
).
table(leduc_minimal$SampleType)
We create a formula
object that will define which variable must be
modelled in our analysis.
f <- ~ 1 + ## intercept Channel + Set + ## batch variables MedianIntensity + ## normalization SampleType ## biological variable
Note that the formula can be adapted to the data set. For instance, no labelling reagents is used for LFQ experiments, so it can be dropped. Similarly, each cell in an LFQ experiment is acquired in a single run so MS run cannot be used as a batch effect variable. The day of acquisition could be used instead.
Once a model is defined, we fit it with scpModelWorkflow()
.
leduc_minimal <- scpModelWorkflow(leduc_minimal, formula = f)
You can always retrieve the formula that was used to fit model using
scpModelFormula(leduc_minimal)
The data that is modelled by each variable are contained in the so-called effect matrices.
scpModelEffects(leduc_minimal)
Similarly, the data that could not be captured by the model are contained in the residual matrix.
scpModelResiduals(leduc_minimal)[1:5, 1:5]
Finally, the input data used to model the can also be retrieved.
scpModelInput(leduc_minimal)[1:5, 1:5]
Note that the number of peptides changed. This is the consequence of peptide filtering.
dim(scpModelInput(leduc_minimal))
The proportion of missing values for each features is high in single-cell proteomics data.
table(missing = is.na(assay(leduc_minimal)))
Many features can typically contain more coefficients to estimate than observed values. These features cannot be estimated and will be ignored during further steps. These features are identified by computing the ratio between the number of observed values and the number of coefficients to estimate. We call it the n/p ratio. You can extract the n/p ratio for each feature:
head(scpModelFilterNPRatio(leduc_minimal))
Once the model is estimated, use scpModelFilterPlot()
to explore the
distribution of n/p ratios across the features.
scpModelFilterPlot(leduc_minimal)
By default, any feature that has an n/p ration greater than 1 is included in the analysis. However, feature with an n/p ratio close to 1 may lead to unreliable outcome because there are not enough observed data. You could consider the n/p ratio as the average number of replicate per coefficient to estimate. Therefore, you may want to increase the n/p threshold.
scpModelFilterThreshold(leduc_minimal) ## default is 1 scpModelFilterThreshold(leduc_minimal) <- 1.5 scpModelFilterThreshold(leduc_minimal) ## threshold is now 1.5
The plot is automatically updated.
scpModelFilterPlot(leduc_minimal)
There is no guidelines for defining a suitable threshold. If too low, you may include noisy peptides that have too few observations. If too high, you may remove many informative peptides. The definition of the threshold relies on a trade off between precision and sensitivity.
The variance analysis reports the relative amount of information that is captured by each cell annotation included in the model. The model also includes the residual information that is not captured by the model. This offers a first glimpse into what information is contained in the data.
(vaRes <- scpVarianceAnalysis(leduc_minimal))
The results are a list of tables, one table for each variable. Each
table reports for each peptide the variance captures (SS
), the
residual degrees of freedom for estimating the variance (df
) and the
percentage of total variance explained (percentExplainedVar
).
vaRes$SampleType
By default, we explore the variance for all peptides combined:
scpVariancePlot(vaRes)
We explore the top 20 peptides that are have the highest percentage of variance explained by the biological variable (top) or by the batch variable (bottom).
scpVariancePlot( vaRes, top = 10, by = "percentExplainedVar", effect = "SampleType", decreasing = TRUE, combined = FALSE ) + scpVariancePlot( vaRes, top = 10, by = "percentExplainedVar", effect = "Set", decreasing = TRUE, combined = FALSE ) + plot_layout(ncol = 1, guides = "collect")
We can also group the peptide by protein. To do so, we first need to
add the peptides annotations available from the rowData
.
vaRes <- scpAnnotateResults( vaRes, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) vaRes$SampleType
Then, we draw the same plot, but this time we provide the fcol
argument.
scpVariancePlot( vaRes, top = 10, by = "percentExplainedVar", effect = "SampleType", decreasing = TRUE, combined = FALSE, fcol = "gene" ) + scpVariancePlot( vaRes, top = 10, by = "percentExplainedVar", effect = "Set", decreasing = TRUE, combined = FALSE, fcol = "gene" ) + plot_layout(ncol = 1, guides = "collect")
In this example dataset, we retrieve peptides that all belong to a different protein, however grouping becomes interesting when analyzing real data sets.
Alternatively, we can generate protein level results by aggregating peptide level results.
vaProtein <- scpVarianceAggregate(vaRes, fcol = "gene") scpVariancePlot( vaProtein, effect = "SampleType", top = 10, combined = FALSE )
Differential abundance analysis dives deeper into the exploration of the data, namely for exploring the biological effects. Given two groups of interest, such as two cell types or two treatment groups, the differential analysis derives estimated fold changes from the linear model's coefficients. This provides information, for each peptide or protein, the amount of change between the two groups and the direction of the change. Moreover, the model provides the uncertainty of the differences, enabling the assessment of the statistical significance.
The difference of interest is specified using the contrast
argument.
The first element points to the variable to test and the two following
element are the groups of interest to compare. You can provide
multiple contrast in a list.
(daRes <- scpDifferentialAnalysis( leduc_minimal, contrasts = list(c("SampleType", "Melanoma", "Monocyte")) ))
Similarly to variance analysis, the results are a list of tables, one table for each contrast. Each table reports for each peptide the estimated difference between the two groups, the standard error associated to the estimation, the degrees of freedom, the t-statistics, the associated p-value and the p-value FDR-adjusted for multiple testing across all peptides.
daRes$SampleType_Melanoma_vs_Monocyte
We then visualize the results using a volcano plot. The function below return a volcano plot for each contrast.
scpVolcanoPlot(daRes)
Since we subset the data set for only a few cell, we lack statistical
power. Still, two peptides come out as significant. Again, to better
explore the results, we add peptide annotations available from the
rowData
, but we also add the n/p ratio as annotation.
daRes <- scpAnnotateResults( daRes, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) np <- scpModelFilterNPRatio(leduc_minimal) daRes <- scpAnnotateResults( daRes, data.frame(feature = names(np), npRatio = np), by = "feature" )
We plot the same volcano plot, but instead of labeling points with
the peptide sequence, we will show the associated gene symbol. Also,
we can control for point aesthetics by providing a list of
ggplot2::geom_point()
arguments. For example, we can colour each
point based on the n/p ratio, and adjust point size and shape.
scpVolcanoPlot( daRes, top = 30, textBy = "gene", pointParams = list(aes(colour = npRatio), size = 1.5, shape = 3) )
We can also provide protein-level results. To do so, the
scpDifferentialAggregate()
relies on the metapod
package. We here
combine the statistical test results for peptides that belong to the
same protein using Simes' method. Simes' method will reject the
combined null hypothesis (that is the mean protein intensities are
identical between two groups) if any of the peptide nulls are
rejected.
byProteinDA <- scpDifferentialAggregate( daRes, fcol = "gene", method = "simes" ) byProteinDA$SampleType_Melanoma_vs_Monocyte
Variance and differential analysis are not specific to single-cell applications and explore the data without considering cellular heterogeneity. The purpose of the component analysis is to dive into the cellular heterogeneity by representing highly dimensional data in a few informative dimensions for visual exploration. We integrate the component analysis with the linear regression model thanks to the APCA+ (extended ANOVA-simultaneous component analysis) framework developed by Thiel et al. 2017. Briefly, APCA+ explores the reconstructed data that is captured by each variable separately in the presence of the unmodelled data. The advantage of this framework is it is generic and works for any linear model. Also, this approach is well suited for single-cell applications as it enables the visualization and exploration of the effects of a known variable along the unmodelled information that contains cellular heterogeneity.
(caRes <- scpComponentAnalysis( leduc_minimal, ncomp = 20, method = "APCA", effect = "SampleType" ))
The results are contained in a list with 2 elements. bySample
contains the PC scores, that is the component results in cell space.
byFeature
contains the eigenvectors, that is the component results
in peptide space. Each of the two elements contains components results
for the data before modelling (unmodelled
), for the residuals or for
the APCA on the sample type variable (APCA_SampleType
).
(caResCells <- caRes$bySample) caResCells[[1]]
Let's explore the component analysis in cell space. Similarly to the previous explorations, we annotate the results.
leduc_minimal$cell <- colnames(leduc_minimal) caResCells <- scpAnnotateResults( caResCells, colData(leduc_minimal), by = "cell" )
We then generate the component plot. Providing the pointParams
argument, we can shape the points by SampleType
. To assess the
impact of batch effects, we also colour the points according to
the MS acquisition run.
scpComponentPlot( caResCells, pointParams = list(aes(shape = SampleType, colour = Set)) ) |> wrap_plots(ncol = 1, guides = "collect")
While the data before modelling is mainly driven by batch effects, the
APCA clearly separates the two cell populations. The plot can however
only show 2 components at a time. We can explore more components using
a subsequent dimension reduction, such as t-SNE. The scater
package
offers a comprehensive set of tools for dimension reduction on data
contained in a SingleCellExperiment
object and requires the
components to be stored in the reducedDim
slot. This is streamlined
thanks to addReducedDims()
.
leduc_minimal <- addReducedDims(leduc_minimal, caResCells) reducedDims(leduc_minimal)
We can now explore the SampleType
effects for the 20 computed
components through t-SNE.
library("scater") leduc_minimal <- runTSNE(leduc_minimal, dimred = "APCA_SampleType") plotTSNE(leduc_minimal, colour_by = "Set", shape_by = "SampleType") + ggtitle("t-SNE on 20 APCA components")
The two cell populations remain clearly separated with an excellent mixing of the acquisition runs, even when considering the 20 first APCA components.
We use the same approach to explore the component results in peptide space.
caResPeps <- caRes$byFeature caResPeps <- scpAnnotateResults( caResPeps, rowData(leduc_minimal), by = "feature", by2 = "Sequence" ) scpComponentPlot( caResPeps, pointParams = list(size = 0.8, alpha = 0.4) ) |> wrap_plots(ncol = 1)
This exploration may identify groups of covarying peptides, although no clear patterns appear in the example data set.
We can also combine the exploration of the components in cell and peptide space. This is possible thanks to biplots.
scpComponentBiplot( caResCells, caResPeps, pointParams = list(aes(colour = SampleType)), labelParams = list(size = 1.5, max.overlaps = 15), textBy = "gene", top = 10 ) |> wrap_plots(ncol = 1, guides = "collect")
Finally, we offer functionality to aggregate the results at the protein level instead of the peptide level.
caResProts <- scpComponentAggregate(caResPeps, fcol = "gene") caResProts$APCA_SampleType
Note that the aggregated tables in caResProts
can be explored with
the visualisation function scpComponentPlot()
.
Based on the estimated model, we generate batch-corrected data, that is data with only the effect of cell type and the residual data. We also remove the intercept.
(leduc_batchCorrect <- scpRemoveBatchEffect( leduc_minimal, effects = c("Set", "Channel", "MedianIntensity"), intercept = TRUE ))
Note that the batch-corrected data still contain missing values. The
leduc_batchCorrect
object can be used for downstream analysis.
knitr::opts_chunk$set( collapse = TRUE, comment = "", crop = NULL )
sessionInfo()
This vignette is distributed under a CC BY-SA license license.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.