This section explains how to manually manipulate the MCMC output of
the Bayesian model TAGM-MCMC applied to spatial proteomics data
[@Crook:2018]. First, we load the MCMC data (as produced by
tagmMcmcTrain
) to be used and the packages required for analysis.
The tanTagm.rda
file is nearly 400MB large and isn't direcly
distributed in a package but can be downloaded from the
http://bit.ly/tagm-mcmc google drive to
reproduce the following analyses. Alternatively, to avoid manual
intervention, the following code chunk downloads the file from an
alternative server:
destdir <- tempdir()
destfile <- file.path(destdir, "tanTagm.rda")
download.file("http://proteome.sysbiol.cam.ac.uk/lgatto/files/tanTagm.rda",
destfile)
load(destfile)
## Load tanTagm data containing the MCMC analysis, which is assumed to
## be available in the working directory.
load("tanTagm.rda")
tanTagm
## Object of class "MCMCParams"
## Method: TAGM.MCMC
## Number of chains: 4
We now load the example data for which we performed this Bayesian analysis. This Drosophila embryo spatial proteomics experiment is from [@Tan2009].
data(tan2009r1) ## get data from pRolocdata
The tanTagm
data was produce by executing the tagmMcmcTrain
function. $20000$ iterations were performed, automatically discarding
$5000$ iterations for burnin, sub-sampling the chains by $10$ and
running a total of $4$ chains in parallel. This results in $4$ chains
each with $1500$ MCMC iterations.
tanTagm <- tagmMcmcTrain(object = tan2009r1,
numIter = 20000,
burnin = 5000,
thin = 10,
numChains = 4)
Using parallel chains for analysis allows us to diagnose whether our MCMC algorithm has converged (or not). The number of parallel MCMC chains used in this analysis was 4.
## Get number of chains
nChains <- length(tanTagm)
nChains
## [1] 4
The following code chunks sets up a manual convegence diagnostic check. We make use of objects and methods in the package coda to peform this analysis [@coda]. We calculate the total number of outliers at each iteration of each chain and if the algorithm has converged this number should be the same (or very similar) across all 4 chains. We can observe this by sight by producing trace plots for each MCMC chain.
## Convergence diagnostic to see if more we need to discard any
## iterations or entire chains: compute the number of outliers for
## each iteration for each chain
out <- mcmc_get_outliers(tanTagm)
## Using coda S3 objects to produce trace plots and histograms
plot(out[[1]], col = "blue", main = "Chain 1")
plot(out[[2]], col = "red", main = "Chain 2")
plot(out[[3]], col = "green", main = "Chain 3")
plot(out[[4]], col = "orange", main = "Chain 4")
We can use the coda package to produce
summaries of our chains. Here is the coda
summary for the first
chain.
## all chains average around 360 outliers
summary(out[[1]])
##
## Iterations = 1:1500
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 357.301 13.864 0.358 0.358
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 329 348 357 367 384
In this case our chains looks very good. They all oscillate around an average of 360 outliers. There is no observed monotonicity in our output. However, for a more rigorous and unbiased analysis of convergence we can calculate the Gelman diagnostics using the coda package [@Gelman:1992,@Brools:1998]. This statistics is often refered to as $\hat{R}$ or the potential scale reduction factor. The idea of the Gelman diagnostics is to so compare the inter and intra chain variance. The ratio of these quantities should be close to one. The actual statistics computed is more complicated, but we do not go deeper here and a more detailed and in depth discussion can be found in the references. The coda package also reports the $95\%$ upper confidence interval of the $\hat{R}$ statistic.
## Can check gelman diagnostic for convergence (values less than <1.05
## are good for convergence)
gelman.diag(out) ## the Upper C.I. is 1 so mcmc has clearly converged
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1 1
We can also look at the Gelman diagnostics statistics for pairs of chains.
## We can also check individual pairs of chains for convergence
gelman.diag(out[1:2]) # the upper C.I is 1.01
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1 1.01
gelman.diag(out[c(1,3)]) # the upper C.I is 1
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1 1
gelman.diag(out[3:4]) # the upper C.I is 1
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## [1,] 1 1
This section explains how to manually manipulate our MCMC chains.
Here we demonstrate how to discard chains that may not have converged. Including chains that have not converged in downstream analysis is likely to lead to nonsensical results. As an example, we demonstrate how to remove the second chain from our domwnstream analysis.
## Our chains look really good but let us discard chain 2, as an
## example in case we didn't believe it had converged. It would be
## possible to remove more than one chain e.g. to remove 2 and 4 using
## c(2, 4).
newTanMcmc <- tanTagm[-2]
## Let check that it looks good
newTanMcmc
## Object of class "MCMCParams"
## Method: TAGM.MCMC
## Number of chains: 3
length(newTanMcmc) == (nChains - 1)
## [1] TRUE
## Let have a look at our first chain
tanChain1 <- pRoloc:::chains(newTanMcmc)[[1]]
tanChain1@n ## Chain has 1500 iterations.
## [1] 1500
We could repeat the convergence diagnostics of the previous section to check for convergence again.
We are happy that our chains have conveged. Let us recap where our analysis up until this point. We have 3 chains each with 1500 iterations. We now demonstrate how to discard iterations from individual chains, since they may have converged some number of iteration into the analysis. Let us use only the iterations of last half of the remaining three chains. This also speeds up computations, however too few iterations in the further analysis is likely to lead to poor results. We find that using at least $1000$ iterations for downstream analysis leads to stable results.
n <- (tanChain1@n)/2 # Number of iterations to keep 750
tanTagm2 <- mcmc_burn_chains(newTanMcmc, n)
tanTagm2
is now an object of class MCMCParams
with 3 chains each
with each 750 iterations.
## Check tanTagmParams object
pRoloc:::chains(tanTagm2)[[1]]
## Object of class "MCMCChain"
## Number of components: 11
## Number of proteins: 677
## Number of iterations: 750
pRoloc:::chains(tanTagm2)[[2]]
## Object of class "MCMCChain"
## Number of components: 11
## Number of proteins: 677
## Number of iterations: 750
pRoloc:::chains(tanTagm2)[[3]]
## Object of class "MCMCChain"
## Number of components: 11
## Number of proteins: 677
## Number of iterations: 750
The summary slot of the tanTagm2
is currently empty, we can now
populate the summary slot of tanTagm2
using the tagmMcmcProcess
function.
## This will automatically pool chains to produce summary (easy to
## create single summaries by subsetting)
tanTagm2 <- tagmMcmcProcess(tanTagm2)
## Let look at this object
summary(tanTagm2@summary@posteriorEstimates)
## tagm.allocation tagm.probability tagm.probability.notOutlier
## ER :196 Min. :0.3055 Min. :0.0002031
## PM :192 1st Qu.:0.8439 1st Qu.:0.1712611
## Ribosome 40S : 75 Median :0.9810 Median :0.6001312
## mitochondrion: 61 Mean :0.8927 Mean :0.5278532
## Proteasome : 49 3rd Qu.:0.9980 3rd Qu.:0.8603698
## Ribosome 60S : 32 Max. :1.0000 Max. :0.9982244
## (Other) : 72
## tagm.probability.Outlier tagm.probability.lowerquantile
## Min. :0.001776 Min. :0.001053
## 1st Qu.:0.139630 1st Qu.:0.591425
## Median :0.399869 Median :0.942714
## Mean :0.472147 Mean :0.769211
## 3rd Qu.:0.828739 3rd Qu.:0.995483
## Max. :0.999797 Max. :0.999988
##
## tagm.probability.upperquantile tagm.mean.shannon
## Min. :0.4961 Min. :0.0000254
## 1st Qu.:0.9698 1st Qu.:0.0141329
## Median :0.9959 Median :0.0911157
## Mean :0.9655 Mean :0.2386502
## 3rd Qu.:0.9994 3rd Qu.:0.4050057
## Max. :1.0000 Max. :1.3051988
##
For a sanity check, let us re-check the diagnostics. This is
re-computed when we excute the tagmMcmcProcess
function and located
in a diagnostics
slot.
## Recomputed diagnostics
tanTagm2@summary@diagnostics
## Point est. Upper C.I.
## outlierTotal 1.00021 1.000274
Let us look at a summary of the analysis.
summary(tanTagm2@summary@tagm.joint)
## Cytoskeleton ER Golgi
## Min. :0.0000000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.0000 1st Qu.:0.0000000
## Median :0.0000040 Median :0.0000 Median :0.0000002
## Mean :0.0227762 Mean :0.2826 Mean :0.0503043
## 3rd Qu.:0.0004646 3rd Qu.:0.9232 3rd Qu.:0.0046922
## Max. :0.7798512 Max. :0.9993 Max. :0.9993792
## Lysosome mitochondrion Nucleus
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.0000001 1st Qu.:0.0000000 1st Qu.:0.0000000
## Median :0.0000016 Median :0.0000000 Median :0.0000000
## Mean :0.0167424 Mean :0.0916050 Mean :0.0365126
## 3rd Qu.:0.0000793 3rd Qu.:0.0000158 3rd Qu.:0.0000059
## Max. :0.9795195 Max. :0.9998841 Max. :0.9999886
## Peroxisome PM Proteasome
## Min. :0.0000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.0000002 Median :0.000003 Median :0.0000002
## Mean :0.0055262 Mean :0.278256 Mean :0.0726235
## 3rd Qu.:0.0000232 3rd Qu.:0.764944 3rd Qu.:0.0005521
## Max. :0.8831085 Max. :0.999998 Max. :0.9979738
## Ribosome 40S Ribosome 60S
## Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.0000001
## Mean :0.096071 Mean :0.0470258
## 3rd Qu.:0.001051 3rd Qu.:0.0001129
## Max. :0.996639 Max. :0.9885391
The pRoloc
function tagmPredict
can be used to append protein MCMC
results to the feature data of our object of class MSnSet
. This
creates new columns in the feature data of the MSnSet
, which can be
used for final analysis of our data.
## We can now use tagmPredict
tan2009r1 <- tagmPredict(tan2009r1, params = tanTagm2)
Now that we have processed our chains, checked convergence and
summarised the results into our MSnset
; we can interrogate our data
for novel biological results. We use the plot2D
function to view the
probabilitic allocations of proteins to sub-cellular niches.
## Create prediction point size
ptsze <- exp(fData(tan2009r1)$tagm.mcmc.probability) - 1
## Create plot2D with pointer scaled with probability
plot2D(tan2009r1,
fcol = "tagm.mcmc.allocation",
cex = ptsze,
main = "protein pointer scaled with posterior localisation probability")
addLegend(object = tan2009r1, where = "topleft", cex = 0.5)
By using the Shannon entropy we can globally visualise uncertainty. Proteins can be scaled with their Shannon entropy and we note that proteins with high Shannon entropy have high uncertainty.
## Visualise shannon entropy
## Create prediction point size
ptsze2 <- 3 * fData(tan2009r1)$tagm.mcmc.mean.shannon
plot2D(tan2009r1, fcol = "tagm.mcmc.allocation", cex = ptsze2,
main = "protein pointer scaled with Shannon entropy")
addLegend(object = tan2009r1, where = "topleft", cex = 0.5)
Our data can be interrogated in other ways. We might be interested in all the proteins that were confidently assigned as outliers. A GO enrichment analysis could be performed on these proteins, since they may reveal biologically interesting results.
## Get outlier lists proteins with probability greater than 0.95 of being outlier
outliers <- rownames(tan2009r1)[fData(tan2009r1)$tagm.mcmc.outlier > 0.95]
outliers
## [1] "P04412" "Q7KJ73" "Q00174" "Q9Y105" "Q9VTZ5" "B7Z0C1"
## [7] "P46150" "Q9VN14" "Q9VU58" "Q9V498" "Q9V496" "P06607"
## [13] "Q7K0F7" "Q9V8R9" "Q9VT75" "A8JNJ6" "Q8SZ38" "Q960V7"
## [19] "Q7KA66" "P98163" "Q9VE24" "Q9VE79" "A8JV22" "O96824"
## [25] "P23226" "Q9VKF0" "Q9VZL6" "P11584" "Q9VDK9" "Q9V4A7"
## [31] "Q9W303" "Q8MLV1" "Q07152" "Q9VEE9" "Q9VXE5" "P13395"
## [37] "Q7JYX2" "P91938" "Q9U969" "Q7JZY1" "O18388" "P02844"
## [43] "NO_ID_7" "P27716" "Q9W478" "Q8IN49" "O18337" "A1Z6L9"
## [49] "Q86P66" "Q9VK04" "Q9GU68" "Q8INH5" "O18333" "P54351"
## [55] "P15215" "P33450" "Q24298" "Q9W404" "Q9VPQ2" "Q9VF20"
## [61] "Q8IPU3" "Q7JVX3" "Q9VV60" "Q9V427" "Q7JQR3" "P52295"
## [67] "A8JRB8" "Q9VTU4" "Q24007" "B8A403" "Q00963" "Q94887"
## [73] "Q4AB31" "Q95SY7" "O15943" "Q9VDV3" "B7Z0D3" "Q9VLT3"
## [79] "Q9VUQ7" "Q9VAG2" "A1Z6H7" "Q27415" "Q9VHL2" "NO_ID_16"
## [85] "Q94518" "Q7KN94" "P29742" "Q9VXB0" "B7Z036" "Q8IRH6"
## [91] "P17917" "M9PDB2" "Q9VVA0" "Q9VHP0"
We might be interested in analysing the localisation of individual proteins and interrogating which other potential localisations these proteins might have. A violin plot of the probabilistic allocation of the following protein Q9VCK0 is visualised. This demonstrates and quantifies the uncertainty in the allocation of this protein.
plot(tanTagm, "Q9VCK0")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.