BiocStyle::markdown() knitr::opts_chunk$set(dpi=300)
library("hdxstats") library("dplyr") library("ggplot2") library("RColorBrewer") library("tidyr") library("pheatmap") library("scales") library("viridis") library("patchwork") library("Biostrings")
This vignette describeds how to analyse time-resolved differential HDX-MS experiments. The key elements are at least two conditions i.e. apo + antibody, apo + small molecule or protein closed + protien open, etc. The experiment can be replicated, though if there are sufficient time points analysed (>=3) then occasionally signficant results can be obtained. The data provided should be centroid-centric data. This package does not yet support analysis straight from raw spectra. Typically this will be provided as a .csv from tools such as dynamiX or HDExaminer.
The package relies of Bioconductor infrastructure so that it integrates with
other data types and can benefit from advantages in other fields of mass-spectrometry.
There are package specific object, classes and methods but importantly there is
reuse of classes found in quantitative proteomics data, mainly the QFeatures
object which extends the SummarisedExperiment
class for mass spectrometry data.
The focus of this package is on testing and visualisation of the testing results.
We will begin with a structural variant experiment in which MHP and a structural variant were mixed in different proportions. HDX-MS was performed on these samples and we expect to see reproducible but subtle differences. We first load the data from the package and it is .csv format.
MBPpath <- system.file("extdata", "MBP.csv", package = "hdxstats")
We can now read in the .csv file and have a quick look at the .csv.
MBP <- read.csv(MBPpath) head(MBP) # have a look length(unique(MBP$pep_sequence)) # peptide sequences
Let us have a quick visualisation of some the data so that we can see some of the features
filter(MBP, pep_sequence == unique(MBP$pep_sequence[1]), pep_charge == 2) %>% ggplot(aes(x = hx_time, y = d, group = factor(replicate_cnt), color = factor(hx_sample, unique(MBP$hx_sample)[c(7,5,1,2,3,4,6)]))) + theme_classic() + geom_point(size = 2) + scale_color_manual(values = brewer.pal(n = 7, name = "Set2")) + labs(color = "experiment", x = "Deuterium Exposure", y = "Deuterium incoperation")
We can see that the units of the time dimension are in seconds and that Deuterium incoperation has been normalized into Daltons.
Working from a .csv is likely to cause issues downstream. Indeed, we run
the risk of accidently changing the data or corrupting the file in some way.
Secondly, all .csvs will be formatted slightly different and so making extensible
tools for these files will be inefficient. Furthermore, working with a generic
class used in other mass-spectrometry fields can speed up analysis and adoption
of new methods. We will work the class QFeatures
from the QFeatures
as it is a powerful and scalable way to store quantitative mass-spectrometry data.
Firstly, the data is storted in long format rather than wide format. We first switch the data to wide format.
MBP_wide <- pivot_wider(data.frame(MBP), values_from = d, names_from = c("hx_time", "replicate_cnt", "hx_sample"), id_cols = c("pep_sequence", "pep_charge")) head(MBP_wide)
We notice that there are many columns with NA
s. The follow code chunk removes
these columns.
MBP_wide <- MBP_wide[, colSums( != nrow(MBP_wide)]
We also note that the colnames are not very informative. We are going to format in a very specific way so that later functions can automatically infer the design from the column names. We provide in the format X(time)rep(replicate)cond(condition)
colnames(MBP_wide)[-c(1,2)] new.colnames <- gsub("0_", "0rep", paste0("X", colnames(MBP_wide)[-c(1,2)])) new.colnames <- gsub("_", "cond", new.colnames) # remove annoying % signs new.colnames <- gsub("%", "", new.colnames) # remove space (NULL could get confusing later and WT is clear) new.colnames <- gsub(" .*", "", new.colnames) new.colnames
We will now parse the data into an object of class QFeatures
, we have provided
a function to assist with this in the package. If you want to do this yourself
use the readQFeatures
function from the QFeatures
MBPqDF <- parseDeutData(object = DataFrame(MBP_wide), design = new.colnames, quantcol = 3:102)
To help us get used to the QFeatures
we show how to generate a heatmap
of these data from this object:
pheatmap(t(assay(MBPqDF)), cluster_rows = FALSE, cluster_cols = FALSE, color = brewer.pal(n = 9, name = "BuPu"), main = "Stuctural variant deuterium incoperation heatmap", fontsize = 14, legend_breaks = c(0, 2, 4, 6, 8, 10, 12, max(assay(MBPqDF))), legend_labels = c("0", "2", "4", "6", "8","10", "12", "Incorporation"))
If you prefer to have the start-to-end residue numbers in the heatmap instead you can change the plot as follows:
regions <- unique(MBP[,c("pep_start", "pep_end")]) xannot <- paste0("[", regions[,1], ",", regions[,2], "]") pheatmap(t(assay(MBPqDF)), cluster_rows = FALSE, cluster_cols = FALSE, color = brewer.pal(n = 9, name = "BuPu"), main = "Stuctural variant deuterium incoperation heatmap", fontsize = 14, legend_breaks = c(0, 2, 4, 6, 8, 10, 12, max(assay(MBPqDF))), legend_labels = c("0", "2", "4", "6", "8","10", "12", "Incorporation"), labels_col = xannot)
It maybe useful to normalize HDX-MS data for either interpretation or visualization purposes. We can normalize by the number of exchangeable amides or by using back-exchange correction values. We first use percentage incorporation as normalisation and visualise as a heatmap.
MBPqDF_norm1 <- normalisehdx(MBPqDF, sequence = unique(MBP$pep_sequence), method = "pc") pheatmap(t(assay(MBPqDF_norm1)), cluster_rows = FALSE, cluster_cols = FALSE, color = brewer.pal(n = 9, name = "BuPu"), main = "Stuctural variant deuterium incoperation heatmap normalised", fontsize = 14, legend_breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2), legend_labels = c("0", "0.2", "0.4", "0.6", "0.8","1", "Incorporation"), labels_col = xannot)
Now, we demonstrate a back-exchange correction calculation. The back-exchange value are fictious by the code chunk below demonstrates how to set this up.
# made-up correction factor correction <- (exchangeableAmides(unique(MBP$pep_sequence)) + 1) * 0.9 MBPqDF_norm2 <- normalisehdx(MBPqDF, sequence = unique(MBP$pep_sequence), method = "bc", correction = correction) pheatmap(t(assay(MBPqDF_norm2)), cluster_rows = FALSE, cluster_cols = FALSE, color = brewer.pal(n = 9, name = "BuPu"), main = "Stuctural variant deuterium incoperation heatmap normalised", fontsize = 14, legend_breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2), legend_labels = c("0", "0.2", "0.4", "0.6", "0.8","1", "Incorporation"), labels_col = xannot)
The hdxstats
package uses an empirical Bayes functional approach to analyse
the data. We explain this idea in steps so that we can get an idea of the approach.
First we fit the parametric model to the data. This will allow us to explore
the HdxStatModel
res <- differentialUptakeKinetics(object = MBPqDF[,1:100], #provide a QFeature object feature = rownames(MBPqDF)[[1]][37], # which peptide to do we fit start = list(a = NULL, b = 0.0001, d = NULL, p = 1)) # what are the starting parameter guesses
Here, we see the HdxStatModel
class, and that a Functional Model was applied
to the data and a total of 7 models were fitted.
The nullmodel
and alternative
slots of an instance of HdxStatModel
the underlying fitted models. The method
and formula
slots provide vital
information about what analysis was performed. The vis
slot provides a ggplot
object so that we can visualise the functional fits.
We can also explore other functional model rather than the weibull for example a model that is the sum of two logistic curves.
res2 <- differentialUptakeKinetics(object = MBPqDF[,1:100], #provide a QFeature object, formula = value ~ a*(1 - exp(-b * timepoint)) + c * (1 - exp(-d * timepoint)), feature = rownames(MBPqDF)[[1]][37], # which peptide to do we fit start = list(a = 0, b = 0.0001, c = 0, d = 0.0001)) # what are the starting parameter guesses
We can then visualize the output of our curve fitting
one question would be can we statistically say whether one model is better than another. We can do this by computing the likelihood ratio between the two proposed models. We can see the likelihoods from the second model are generally higher.
logLik(res) logLik(res2)
Let compute minus twice the difference of the log-likelihoods:
diff <- -2 * (logLik(res) - logLik(res2))
Wilk's theorem tell us that this statistics is asymptotically distributed as chi-squared with degree of freedom equal to the difference of the degrees of freedom of each model. Each model has 4 degrees of freedom so in this case there is no need to apply any theoretical results. The second model is preferred, as it as higher log-likelihood
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.