library("pRoloc")
library("pRolocdata")
require("reshape")
require("ggplot2")
require("coda")

setStockcol(paste0(getStockcol(), 90))

TAGM-MCMC, in details

This section explains how to manually manipulate the MCMC output of the Bayesian model TAGM-MCMC applied to spatial proteomics data [@Crook:2018]. First, we load the MCMC data (as produced by tagmMcmcTrain) to be used and the packages required for analysis.

The tanTagm.rda file is nearly 400MB large and isn't direcly distributed in a package but can be downloaded from the http://bit.ly/tagm-mcmc google drive to reproduce the following analyses. Alternatively, to avoid manual intervention, the following code chunk downloads the file from an alternative server:

destdir <- tempdir()
destfile <- file.path(destdir, "tanTagm.rda")
download.file("http://proteome.sysbiol.cam.ac.uk/lgatto/files/tanTagm.rda",
              destfile)
load(destfile)
## Load tanTagm data containing the MCMC analysis, which is assumed to
## be available in the working directory.
load("tanTagm.rda")
tanTagm

We now load the example data for which we performed this Bayesian analysis. This Drosophila embryo spatial proteomics experiment is from [@Tan2009].

data(tan2009r1) ## get data from pRolocdata

The tanTagm data was produce by executing the tagmMcmcTrain function. $20000$ iterations were performed, automatically discarding $5000$ iterations for burnin, sub-sampling the chains by $10$ and running a total of $4$ chains in parallel. This results in $4$ chains each with $1500$ MCMC iterations.

tanTagm <- tagmMcmcTrain(object = tan2009r1,
                         numIter = 20000,
                         burnin = 5000,
                         thin = 10,
                         numChains = 4)

Data exploration and convergence diagnostics

Using parallel chains for analysis allows us to diagnose whether our MCMC algorithm has converged (or not). The number of parallel MCMC chains used in this analysis was 4.

## Get number of chains
nChains <- length(tanTagm)
nChains

The following code chunks sets up a manual convegence diagnostic check. We make use of objects and methods in the package r BiocStyle::CRANpkg("coda") to peform this analysis [@coda]. We calculate the total number of outliers at each iteration of each chain and if the algorithm has converged this number should be the same (or very similar) across all 4 chains. We can observe this by sight by producing trace plots for each MCMC chain.

## Convergence diagnostic to see if more we need to discard any
## iterations or entire chains: compute the number of outliers for
## each iteration for each chain
out <- mcmc_get_outliers(tanTagm)

## Using coda S3 objects to produce trace plots and histograms
plot(out[[1]], col = "blue", main = "Chain 1")
plot(out[[2]], col = "red", main = "Chain 2")
plot(out[[3]], col = "green", main = "Chain 3")
plot(out[[4]], col = "orange", main = "Chain 4")

We can use the r BiocStyle::CRANpkg("coda") package to produce summaries of our chains. Here is the coda summary for the first chain.

## all chains average around 360 outliers
summary(out[[1]])

In this case our chains looks very good. We can see that MCMC algorithm allocated, on average, 360 proteins to the outlier component. The iteration are rapidly oscillating around this value and there is no observed monotonicity in our output. This suggest that our chains have probably converged. The value 360 is not important, different datasets will allocate different numbers of proteins to the outlier component. Do not be surprised if large proportion of your proteins are allocated to the outlier component. Furthermore, as the chain is further processed in downstream analysis, this value might change. Should you find that all proteins are allocated to the outlier component, then check that there are no processing errors and the data is of good quality. If there are still issues consider change the default priors, however changing prior is unlikely to save low resolution data.

However, for a more rigorous and unbiased analysis of convergence we can calculate the Gelman diagnostics using the coda package [@Gelman:1992,@Brools:1998]. This statistics is often refered to as $\hat{R}$ or the potential scale reduction factor. The idea of the Gelman diagnostics is to so compare the inter and intra chain variance. The ratio of these quantities should be close to one. $\hat{R}$ values less than $1.1$ are likely to suggest convergence, though stable results are found if $\hat(R)$ is less than $1.2$. See the link for an extended discussion and justification http://www.stat.columbia.edu/~gelman/research/published/brooksgelman2.pdf. The actual statistics computed is more complicated, but we do not go deeper here and a more detailed and in depth discussion can be found in the references. The coda package also reports the $95\%$ upper confidence interval of the $\hat{R}$ statistic. If that statistic is to0 high, carefully inspect your chains. If it appears that a single chain is behaving poorly then you should discard it from downstream analysis and reassess convergence - how to do this is explain below. Examples of poor behaviour include (1) not oscillating around the same value, (2) monotonicity in the chain, (3) catapillar or wave like behaviour. Including, poor quality MCMC chains in downstream analysis will give poor results. If a chain stabilises after a particular value, then consider discarding iteration up to that point - this can be done using \code{mcmc_burn_chains} function, which is detailed below. Some chains cannot be rescued and the MCMC algorithm must be run for more iterations. In this case, restart your analysis and increase the value of \code{numIter}. If your chains look high quality at this point you can skip to processing your results section.

## Can check gelman diagnostic for convergence (values less than <1.05
## are good for convergence)
gelman.diag(out) ## the Upper C.I. is 1 so mcmc has clearly converged

We can also look at the Gelman diagnostics statistics for pairs of chains.

## We can also check individual pairs of chains for convergence
gelman.diag(out[1:2]) # the upper C.I is 1.01
gelman.diag(out[c(1,3)]) # the upper C.I is 1
gelman.diag(out[3:4]) # the upper C.I is 1

Manually manipulating MCMC chains

This section explains how to manually manipulate our MCMC chains.

Discarding entire chains

Here we demonstrate how to discard chains that may not have converged. Including chains that have not converged in downstream analysis is likely to lead to nonsensical results. As an example, we demonstrate how to remove the second chain from our domwnstream analysis.

## Our chains look really good but let us discard chain 2, as an
## example in case we didn't believe it had converged.  It would be
## possible to remove more than one chain e.g. to remove 2 and 4 using
## c(2, 4).
newTanMcmc <- tanTagm[-2]

## Let check that it looks good
newTanMcmc
length(newTanMcmc) == (nChains - 1)

## Let have a look at our first chain
tanChain1 <- pRoloc:::chains(newTanMcmc)[[1]]
tanChain1@n ## Chain has 1500 iterations.

We could repeat the convergence diagnostics of the previous section to check for convergence again.

Discarding iterations from each chain

We are happy that our chains have conveged. Let us recap where our analysis up until this point. We have 3 chains each with 1500 iterations. We now demonstrate how to discard iterations from individual chains, since they may have converged some number of iteration into the analysis. Let us use only the iterations of last half of the remaining three chains. This also speeds up computations, however too few iterations in the further analysis is likely to lead to poor results. We find that using at least $1000$ iterations for downstream analysis leads to stable results.

n <- (tanChain1@n)/2 # Number of iterations to keep 750
tanTagm2 <- mcmc_burn_chains(newTanMcmc, n)

tanTagm2 is now an object of class MCMCParams with 3 chains each with each 750 iterations.

## Check tanTagmParams object
pRoloc:::chains(tanTagm2)[[1]]
pRoloc:::chains(tanTagm2)[[2]]
pRoloc:::chains(tanTagm2)[[3]]

Processing and summarising MCMC results

Populating the summary slot

The summary slot of the tanTagm2 is currently empty, we can now populate the summary slot of tanTagm2 using the tagmMcmcProcess function.

## This will automatically pool chains to produce summary (easy to
## create single summaries by subsetting)
tanTagm2 <- tagmMcmcProcess(tanTagm2)

## Let look at this object
summary(tanTagm2@summary@posteriorEstimates)

For a sanity check, let us re-check the diagnostics. This is re-computed when we excute the tagmMcmcProcess function and located in a diagnostics slot.

## Recomputed diagnostics
tanTagm2@summary@diagnostics

Let us look at a summary of the analysis.

summary(tanTagm2@summary@tagm.joint)

Appending results to an MSnSet

The pRoloc function tagmPredict can be used to append protein MCMC results to the feature data of our object of class MSnSet. This creates new columns in the feature data of the MSnSet, which can be used for final analysis of our data.

## We can now use tagmPredict
tan2009r1 <- tagmPredict(tan2009r1, params = tanTagm2)

Visualising MCMC results

Visualising prediction results

Now that we have processed our chains, checked convergence and summarised the results into our MSnset; we can interrogate our data for novel biological results. We use the plot2D function to view the probabilitic allocations of proteins to sub-cellular niches.

## Create prediction point size
ptsze <- exp(fData(tan2009r1)$tagm.mcmc.probability) - 1

## Create plot2D with pointer scaled with probability
plot2D(tan2009r1,
       fcol = "tagm.mcmc.allocation",
       cex = ptsze,
       main = "protein pointer scaled with posterior localisation probability")

addLegend(object = tan2009r1, where = "topleft", cex = 0.5)

Visualising allocation uncertainty

By using the Shannon entropy we can globally visualise uncertainty. Proteins can be scaled with their Shannon entropy and we note that proteins with high Shannon entropy have high uncertainty.

## Visualise shannon entropy
## Create prediction point size
ptsze2 <- 3 * fData(tan2009r1)$tagm.mcmc.mean.shannon
plot2D(tan2009r1, fcol = "tagm.mcmc.allocation", cex = ptsze2,
       main = "protein pointer scaled with Shannon entropy")
addLegend(object = tan2009r1, where = "topleft", cex = 0.5)

Extracting proteins of interest

Our data can be interrogated in other ways. We might be interested in all the proteins that were confidently assigned as outliers. A GO enrichment analysis could be performed on these proteins, since they may reveal biologically interesting results.

## Get outlier lists proteins with probability greater than 0.95 of being outlier
outliers <- rownames(tan2009r1)[fData(tan2009r1)$tagm.mcmc.outlier > 0.95]
outliers

Extracting information for individual proteins

We might be interested in analysing the localisation of individual proteins and interrogating which other potential localisations these proteins might have. A violin plot of the probabilistic allocation of the following protein Q9VCK0 is visualised. This demonstrates and quantifies the uncertainty in the allocation of this protein.

plot(tanTagm, "Q9VCK0")


lgatto/pRoloc documentation built on Oct. 23, 2024, 12:51 a.m.