This document summarises thoughts about improving data analysis
workflows for mass spectrometry-based single-cell proteomics. The
effort here is in response to our recent articles. In the overview of
data processing workflows (Vanderaa et Gatto
2023), we highlighted the lack of
consensus when processing data leading to a "one lab, one workflow"
situation. Although this might be the consequence of heterogeneous
data structures, we could show this is not the case and is rather due
to the lack of a principled approach. This document provides future
tracks for scp
and beyond to harmonise the current practices.
Van der Watt et al. 2021 showed it is important to benchmark spectrum identification tools. Furthermore, the same lab later reported that the properties of the spectra from SCP experiment do not fulfill the assumption of identification tools designed on bulk proteomics samples (Boekweg et al. 2021).
There is definitely room for improvement, here are some thoughts about SCP spectrum identification:
Low-quality features (PSMs) and low-quality samples (single-cells) may introduce noise (eg because of low signal due to sample prep issues) or bias (eg because a PSMs is wrongly matched).
It seems sensible to identify and remove low-quality features and samples before they propagate through later processing steps. Data quality control should therefore be the very first step. I find it surprising that many feature or sample quality controls are performed at a later stage. From all the surveyed QC metrics, none require upstream processing and can be applied immediately.
The most difficult part of sample preparation is to get rid of the background signal/contamination such as FACS carry-over (cf discussion with Harrison Specht and Andrew Leduc). Negative controls may be used in a way to model systematic background contamination. This modelling should take into account that the background changes with time.
Checking for protein coverage is barely explored in SCP, except in Orsburn et al. 2022.
Identify failed cells and failed runs based on:
A reference in single-cell is about 5-cell equivalents and is generated by dilution of bulk lysates. This sample is then used to subtract or divide the peptide intensities of the sample in the same batch. Therefore, we should call this batch correction rather than normalisation. I dislike this approach since noise is associated to each reference acquisitions. Olga Vitek suggested including at least 2 channels to better estimate the "normalisation" factor. We also refrained from "normalising" in the linear scale (division by reference), but rather perform it in the log scale (subtraction).
Only applicable for multiplexed experiments: since we perform batch correction for MS acquisition runs and for multiplexing labels (TMT, mTRAQ), is normalisation still required?
I answered this question with scplainer, and yes normalisation is very important and captures cell-specific variations that are not captured by combined modelling of label and MS batch. scplainer currently uses the median peptide abundance as a normalisation variable during modelling. In this case, normalisation captures both technical variation, such as small differences in sample prep, and biological variation, for instance differences in cell size or metabolic activity. Whether the later should be included or excluded from downstream analyses is a conceptual consideration. My intuition is that:
If indeed these two types of abundance matter, the model should be able to decouple cell-specific technical variation from cell-specific biological variation. To do so, we could model the normalisation factor as a function of biological variables such as the cell size. For instance, FACS experiments measure SSC and FSC that provide proxies for granularity and size, respectively. Alternatively, CellenOne measures cell sizes during dispensing. I however noticed in the schoof2021 dataset a poor correlation between FSC/SSC and the median intensity. One explanation is that the factors that drive the need for normalisation are more technical than biological and happen after cell dispensing. Leduc et al. 2022 observed good correlation between total protein intensities and cell sizes, as measured by the CellenOne. I could confirm this finding with our modelling approach using minimal data processing. However, the relationship is not perfect, indicating that technical variability is influencing the intensity. The modelling the normalisation factor as a function of cell size would enable an effective decoupling of biological effects and technical effects. The estimated trend would represent biological normalisation while the residuals would represent technical normalisation, each being included in the scplainer modelling approach.
Common practice to estimate sample normalisation factors is to compute median, mean or total intensity for each cell. I would choose median over mean as it is more robust against outliers, but is this the best solution?
Another approach is to align quantiles. While it makes sense for bulk samples, I'm not convinced that different cells should share the same feature intensity distributions, especially in the presence of highly missing data.
Another approach would be to get inspiration from the deconvolution
approach implemented in the scran
package by Lun et al.
2016, but this would
require some efforts to explore the suitability and to adapt the
algorithm.
Replacing value with batch corrected values does not include estimation uncertainty.
It was clear from the Jupiter Notebooks from Schoof et al. 2021 that the TMT label has an impact on quantification. We further confirmed these finding for most multiplexed experiments that the label has an impact on quantifications, but this impact can be modelled.
SCP data seem not to be MNAR, but I couldn't find a convincing framework to demonstrate this. Instead, I rely on the relationship between average feature intensity and feature detection rate, as suggested by Li and Smyth 2023. However, their approach does not account for the presence of batch effect and may be biased for SCP.
I think SCP data is therefore mostly MCAR. I find the MCAR definition bears only little information. It says that the data is missing unrelated to the underlying value, but there must be other reasons for values to be missing for which we could model the mechanism. For instance, in DDA, we could model the detection rate of a peptide with respect to the MS1 rank of its precursor in each run. Is a peptide not detected because it was not selected for MS2? For DIA, we could adapt the approach and model the detection rate with the expected spectral complexity determined in MS1. Other variables could be peptide properties. Could we predict peptide ionisation based on its physico-chemical properties?
We discussed this in Vanderaa et Gatto 2023.
Some alternative approaches that serves as inspiration:
library(scpdata)
specht <- specht2019v3()
featAnn <- rowData(specht[[1]])
featAnn$ID <- paste0(featAnn$Set, featAnn$Sequence, featAnn$Charge,
featAnn$Modifications)
dupl <- featAnn$ID[duplicated(featAnn$ID)]
featAnn %>%
data.frame %>%
filter(Set == Set[[1]],
ID %in% dupl) %>%
ggplot() +
aes(y = dart_qval,
x = Retention.time,
colour = ID,
group = ID) +
geom_point() +
geom_line() +
theme(legend.position = "none")
Perform protein-level infers using peptide-level data, as performed by the MsqRob model.
For most datasets, linear regression is sufficient and we can retrieve unknown sources of variation from the residuals. This two-step approach could be formalized through the optimisation of latent variables, as performed in RUV-III-C (Poulos et al. 2020) or ZINB-WaVE (Risso et al. 2018).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.