library("pRoloc") library("pRolocdata") require("reshape") require("ggplot2") require("coda") setStockcol(paste0(getStockcol(), 90))
This section explains how to manually manipulate the MCMC output of
the Bayesian model TAGM-MCMC applied to spatial proteomics data
[@Crook:2018]. First, we load the MCMC data (as produced by
tagmMcmcTrain
) to be used and the packages required for analysis.
The tanTagm.rda
file is nearly 400MB large and isn't direcly
distributed in a package but can be downloaded from the
http://bit.ly/tagm-mcmc google drive to
reproduce the following analyses. Alternatively, to avoid manual
intervention, the following code chunk downloads the file from an
alternative server:
destdir <- tempdir() destfile <- file.path(destdir, "tanTagm.rda") download.file("http://proteome.sysbiol.cam.ac.uk/lgatto/files/tanTagm.rda", destfile) load(destfile)
## Load tanTagm data containing the MCMC analysis, which is assumed to ## be available in the working directory. load("tanTagm.rda") tanTagm
We now load the example data for which we performed this Bayesian analysis. This Drosophila embryo spatial proteomics experiment is from [@Tan2009].
data(tan2009r1) ## get data from pRolocdata
The tanTagm
data was produce by executing the tagmMcmcTrain
function. $20000$ iterations were performed, automatically discarding
$5000$ iterations for burnin, sub-sampling the chains by $10$ and
running a total of $4$ chains in parallel. This results in $4$ chains
each with $1500$ MCMC iterations.
tanTagm <- tagmMcmcTrain(object = tan2009r1, numIter = 20000, burnin = 5000, thin = 10, numChains = 4)
Using parallel chains for analysis allows us to diagnose whether our MCMC algorithm has converged (or not). The number of parallel MCMC chains used in this analysis was 4.
## Get number of chains nChains <- length(tanTagm) nChains
The following code chunks sets up a manual convegence diagnostic
check. We make use of objects and methods in the package
r BiocStyle::CRANpkg("coda")
to peform this analysis [@coda]. We
calculate the total number of outliers at each iteration of each chain
and if the algorithm has converged this number should be the same (or
very similar) across all 4 chains. We can observe this by sight by
producing trace plots for each MCMC chain.
## Convergence diagnostic to see if more we need to discard any ## iterations or entire chains: compute the number of outliers for ## each iteration for each chain out <- mcmc_get_outliers(tanTagm) ## Using coda S3 objects to produce trace plots and histograms plot(out[[1]], col = "blue", main = "Chain 1") plot(out[[2]], col = "red", main = "Chain 2") plot(out[[3]], col = "green", main = "Chain 3") plot(out[[4]], col = "orange", main = "Chain 4")
We can use the r BiocStyle::CRANpkg("coda")
package to produce
summaries of our chains. Here is the coda
summary for the first
chain.
## all chains average around 360 outliers summary(out[[1]])
In this case our chains looks very good. We can see that MCMC algorithm allocated, on average, 360 proteins to the outlier component. The iteration are rapidly oscillating around this value and there is no observed monotonicity in our output. This suggest that our chains have probably converged. The value 360 is not important, different datasets will allocate different numbers of proteins to the outlier component. Do not be surprised if large proportion of your proteins are allocated to the outlier component. Furthermore, as the chain is further processed in downstream analysis, this value might change. Should you find that all proteins are allocated to the outlier component, then check that there are no processing errors and the data is of good quality. If there are still issues consider change the default priors, however changing prior is unlikely to save low resolution data.
However, for a more rigorous and unbiased analysis of convergence we can calculate the Gelman diagnostics using the coda package [@Gelman:1992,@Brools:1998]. This statistics is often refered to as $\hat{R}$ or the potential scale reduction factor. The idea of the Gelman diagnostics is to so compare the inter and intra chain variance. The ratio of these quantities should be close to one. $\hat{R}$ values less than $1.1$ are likely to suggest convergence, though stable results are found if $\hat(R)$ is less than $1.2$. See the link for an extended discussion and justification http://www.stat.columbia.edu/~gelman/research/published/brooksgelman2.pdf. The actual statistics computed is more complicated, but we do not go deeper here and a more detailed and in depth discussion can be found in the references. The coda package also reports the $95\%$ upper confidence interval of the $\hat{R}$ statistic. If that statistic is to0 high, carefully inspect your chains. If it appears that a single chain is behaving poorly then you should discard it from downstream analysis and reassess convergence - how to do this is explain below. Examples of poor behaviour include (1) not oscillating around the same value, (2) monotonicity in the chain, (3) catapillar or wave like behaviour. Including, poor quality MCMC chains in downstream analysis will give poor results. If a chain stabilises after a particular value, then consider discarding iteration up to that point - this can be done using \code{mcmc_burn_chains} function, which is detailed below. Some chains cannot be rescued and the MCMC algorithm must be run for more iterations. In this case, restart your analysis and increase the value of \code{numIter}. If your chains look high quality at this point you can skip to processing your results section.
## Can check gelman diagnostic for convergence (values less than <1.05 ## are good for convergence) gelman.diag(out) ## the Upper C.I. is 1 so mcmc has clearly converged
We can also look at the Gelman diagnostics statistics for pairs of chains.
## We can also check individual pairs of chains for convergence gelman.diag(out[1:2]) # the upper C.I is 1.01 gelman.diag(out[c(1,3)]) # the upper C.I is 1 gelman.diag(out[3:4]) # the upper C.I is 1
This section explains how to manually manipulate our MCMC chains.
Here we demonstrate how to discard chains that may not have converged. Including chains that have not converged in downstream analysis is likely to lead to nonsensical results. As an example, we demonstrate how to remove the second chain from our domwnstream analysis.
## Our chains look really good but let us discard chain 2, as an ## example in case we didn't believe it had converged. It would be ## possible to remove more than one chain e.g. to remove 2 and 4 using ## c(2, 4). newTanMcmc <- tanTagm[-2] ## Let check that it looks good newTanMcmc length(newTanMcmc) == (nChains - 1) ## Let have a look at our first chain tanChain1 <- pRoloc:::chains(newTanMcmc)[[1]] tanChain1@n ## Chain has 1500 iterations.
We could repeat the convergence diagnostics of the previous section to check for convergence again.
We are happy that our chains have conveged. Let us recap where our analysis up until this point. We have 3 chains each with 1500 iterations. We now demonstrate how to discard iterations from individual chains, since they may have converged some number of iteration into the analysis. Let us use only the iterations of last half of the remaining three chains. This also speeds up computations, however too few iterations in the further analysis is likely to lead to poor results. We find that using at least $1000$ iterations for downstream analysis leads to stable results.
n <- (tanChain1@n)/2 # Number of iterations to keep 750 tanTagm2 <- mcmc_burn_chains(newTanMcmc, n)
tanTagm2
is now an object of class MCMCParams
with 3 chains each
with each 750 iterations.
## Check tanTagmParams object pRoloc:::chains(tanTagm2)[[1]] pRoloc:::chains(tanTagm2)[[2]] pRoloc:::chains(tanTagm2)[[3]]
The summary slot of the tanTagm2
is currently empty, we can now
populate the summary slot of tanTagm2
using the tagmMcmcProcess
function.
## This will automatically pool chains to produce summary (easy to ## create single summaries by subsetting) tanTagm2 <- tagmMcmcProcess(tanTagm2) ## Let look at this object summary(tanTagm2@summary@posteriorEstimates)
For a sanity check, let us re-check the diagnostics. This is
re-computed when we excute the tagmMcmcProcess
function and located
in a diagnostics
slot.
## Recomputed diagnostics tanTagm2@summary@diagnostics
Let us look at a summary of the analysis.
summary(tanTagm2@summary@tagm.joint)
The pRoloc
function tagmPredict
can be used to append protein MCMC
results to the feature data of our object of class MSnSet
. This
creates new columns in the feature data of the MSnSet
, which can be
used for final analysis of our data.
## We can now use tagmPredict tan2009r1 <- tagmPredict(tan2009r1, params = tanTagm2)
Now that we have processed our chains, checked convergence and
summarised the results into our MSnset
; we can interrogate our data
for novel biological results. We use the plot2D
function to view the
probabilitic allocations of proteins to sub-cellular niches.
## Create prediction point size ptsze <- exp(fData(tan2009r1)$tagm.mcmc.probability) - 1 ## Create plot2D with pointer scaled with probability plot2D(tan2009r1, fcol = "tagm.mcmc.allocation", cex = ptsze, main = "protein pointer scaled with posterior localisation probability") addLegend(object = tan2009r1, where = "topleft", cex = 0.5)
By using the Shannon entropy we can globally visualise uncertainty. Proteins can be scaled with their Shannon entropy and we note that proteins with high Shannon entropy have high uncertainty.
## Visualise shannon entropy ## Create prediction point size ptsze2 <- 3 * fData(tan2009r1)$tagm.mcmc.mean.shannon plot2D(tan2009r1, fcol = "tagm.mcmc.allocation", cex = ptsze2, main = "protein pointer scaled with Shannon entropy") addLegend(object = tan2009r1, where = "topleft", cex = 0.5)
Our data can be interrogated in other ways. We might be interested in all the proteins that were confidently assigned as outliers. A GO enrichment analysis could be performed on these proteins, since they may reveal biologically interesting results.
## Get outlier lists proteins with probability greater than 0.95 of being outlier outliers <- rownames(tan2009r1)[fData(tan2009r1)$tagm.mcmc.outlier > 0.95] outliers
We might be interested in analysing the localisation of individual proteins and interrogating which other potential localisations these proteins might have. A violin plot of the probabilistic allocation of the following protein Q9VCK0 is visualised. This demonstrates and quantifies the uncertainty in the allocation of this protein.
plot(tanTagm, "Q9VCK0")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.