knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
We would like to simulate a count table, where some rows have differential abundance between different phenotypic outcomes. The count table has entities in rows and samples in columns. The easy way to do it is to randomly select some rows and increase or decrease the counts of samples with certain phenotypic outcomes. However, this might cause issues in compositional data because the proportion of a row might change even though its abundance has not changed. To avoid such issues, instead of adding counts to specific rows, we swap abundances between groups of rows. The group is based on the hierarchical structure of the entities, and groups with different sizes represent different hierarchical level.
```r Different simulated signal patters. The branches that have abundance phenotypic outcome associated are colored, and the others are shrinked to save space. The change direction is shown as color: blue (increase) and orange (decrease). The size of the points represents the amount of change (e.g., fold change) "}
suppressPackageStartupMessages({ library(treeAGG) library(ggtree) library(ggplot2) library(cowplot) })
data("exTree")
leaf1 <- findOS(tree = exTree, ancestor = 52, only.leaf = TRUE) l1 <- length(leaf1) leaf2 <- findOS(tree = exTree, ancestor = 76, only.leaf = TRUE) l2 <- length(leaf2) leaf3 <- c(4, 3, 6, 14, 10, 9)
fig1 <- treePlot(tree = exTree, branch.length = "none", branch = c(52, 76), col.branch = c("blue", "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20 ) + geom_point2(aes(subset = (node %in% leaf1)), color = "blue", size = 2) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = 2) + coord_flip() + scale_x_reverse()
fig2 <- treePlot(tree = exTree, branch.length = "none", branch = c(52, 76), col.branch = c("blue", "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20) + geom_point2(aes(subset = (node %in% leaf1)), color = "blue", size = runif(l1)3) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = runif(l2)3) + coord_flip() + scale_x_reverse()
fig3 <- treePlot(tree = exTree, branch.length = "none", branch = c(leaf3, 76), col.branch = c(rep("blue", length(leaf3)), "orange"), col.other = "grey30", zoomNode = c(66, 69, 95, 85), zoomScale = 1/20) + geom_point2(aes(subset = (node %in% leaf3)), color = "blue", size = 2) + geom_point2(aes(subset = (node %in% leaf2)), color = "orange", size = 2)+ coord_flip() + scale_x_reverse()
plot_grid(fig1, fig2, fig3, labels = c("S1", "S2", "S3"), nrow = 1)
The counts of entities in a sample are assumed to follow a Dirichlet Multinomial distribution as has been suggested in several articles [@Zhao2015,@Tang2016a,@Tang2016b,@Wu2016]: $$\mathbf{x}=(x_1,\dots,x_K) \sim \text{dirmult}(n, \mathbf{\alpha})$$ \begin{equation} \Pr(\mathbf{x}\mid\boldsymbol{\alpha})= \frac{\left(n!\right)\Gamma\left(\alpha_0\right)} {\Gamma\left(n+\alpha_0\right)}\prod_{k=1}^K \frac{\Gamma(x_{k}+\alpha_{k})}{\left(x_{k}!\right)\Gamma(\alpha_{k})} (\#eq:dirmult1) \end{equation} $\mathbf{x}=(x_1,\dots,x_K)$ are the counts of $K$ entities, $n$ is the library size of a sample, $\mathbf{\alpha} = (\alpha_1, \alpha_2, ..., \alpha_K)$ are the parameters for the distribution, and $\alpha_0 = \sum \alpha_k$. In the `r CRANpkg("dirmult")` package, the Dirichlet Multinomial distribution (equation \@ref(eq:dirmult1)) is reparameterized with $\mathbf{\pi} = (\pi_1, ..., \pi_K)$ and $\theta$ so that $\alpha_0 = \frac{(1-\theta)}{\theta}$ and $\alpha_k = \pi_k \alpha_0$. Here, $\pi_k$ is the expected proportion of entity $k$ in a sample. The technical details in the simulation procedure are as below. 1. **Estimate the parameters for simulation from a real data.** Estimate the proportions $\mathbf{\hat{\pi}} = (\hat{\pi}_{1}, \hat{\pi}_{2}, ..., \hat{\pi}_{m})$ and the overdispersion $\hat{\theta}$ from the real data \cite{Charlson2010} via maximum likelihood using package `r CRANpkg("dirmult")`. 2. **Randomly select two branches with different proportions to swap.** The selected branches are zoomed in (see Figure \@ref(fig:scenarioDiff)). We provide 3 simulation scenarios (S1, S2, and S3), which differ in how the swap is performed. + S1 (the left plot in Figure \@ref(fig:scenarioDiff)): The proportions of entities in sample $i$ are $\mathbf{p}^{G1}_{i}$ and $\mathbf{p}^{G2}_{i}$ for the group G1 and G2 (the phenotypic outcome), respectively. The values of $\mathbf{p}^{G1}_{i}$ and $\mathbf{p}^{G2}_{i}$ are randomly generated from the Dirichlet distribution as below. $$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}, \hat{\theta}).$$ $$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}'}, \hat{\theta})$$ Compared with group G1, the proportions of entities on the blue and orange branch have been changed correspondingly by a common factor $r$ or $\frac{1}{r}$ in group G2. \begin{equation} \left\{\begin{matrix} {\hat{\pi}}'_{k} =\hat{\pi}_{k}; & k \notin \text{colored branches}\\ {\hat{\pi}}'_{k} = r\hat{\pi}_{k}; & k \in \text{ Blue branch } \\ {\hat{\pi}}'_{k} =\frac{1}{r}\hat{\pi}_{k}; & k \in \text{ Orange branch} \end{matrix}\right. (\#eq:s1) \end{equation} where $r = \frac{\sum_{k \in Orange} \hat{\pi}_{k}}{\sum_{k \in Blue} \hat{\pi}_{k}}$ is the fold change. + S2 (the middle plot in Figure \@ref(fig:scenarioDiff)): The proportions of entities in sample $i$ are $\mathbf{p}^{G1}_{i}$ and $\mathbf{p}^{G2}_{i}$ for the group G1 and G1, respectively. They are randomly generated from the Dirichlet distribution as below. $$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\bf{\hat{\pi}}, \hat{\theta}).$$ $$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\hat{\pi}', \hat{\theta})$$ Compared with group G1, the proportions of entities on the colored branches have changed by a common factor $r_k$ in group G2. $r_k$ varies among entities in the same branch, but they are either all above one or all below one. \begin{equation} (\#eq:s2) \left\{\begin{matrix} {\hat{\pi}}'_{k} =\hat{\pi}_{k}; & k \notin \text{colored branches}\\ {\hat{\pi}}'_{k} = r_k \hat{\pi}_{k}, \quad r_k = 1 + \frac{\sum_{t \in Orange} \hat{\pi}_{t} - \sum_{t \in Blue} \hat{\pi}_{t}}{\sum_{k \in Blue} \hat{\pi}_{k} u_k} u_k; & k \in \text{ Blue branch } \\ {\hat{\pi}}'_{k} = r_k \hat{\pi}_{k}, \quad r_k = 1 + \frac{\sum_{t \in Blue} \hat{\pi}_{t} - \sum_{t \in Orange} \hat{\pi}_{t}}{\sum_{k \in Orange} \hat{\pi}_{k} u_k} u_k & k \in \text{ Orange branch} \end{matrix}\right. \end{equation} where $u_k$ is randomly generated from a uniformly distributed function ($u_k \sim \text{U}(0, 1)$). The proof that the proportions of entities on the blue branch is swapped with that on the orange branch is in the [Appendix](#appendix). + S3 (the right plot in Figure \@ref(fig:scenarioDiff)): As shown in Figure \@ref(fig:scenarioDiff), all entities in one of the selected branches are colored as blue in S1, but only some entities from that branch are colored as blue in S3. This means not all entities on that branch are differentially abundant in S3. Technically, we could not say those entities with blue color in Figure \@ref(fig:scenarioDiff) is a branch, but for the convenience in explanation, they are refered as the blue branch. We allow users to adjust the proportion of leaf nodes colored as blue in that zoom branch via an argument `pct` in the `SimData` function. $$\mathbf{p}^{G1}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}, \hat{\theta}).$$ $$\mathbf{p}^{G2}_{i} \sim \text{Dirichlet}(\mathbf{\hat{\pi}}', \hat{\theta})$$ Compared with group G1, the proportions of entities on the orange branch have changed by a common factor $c$ in group G2. The value $c$ could be chosen from 0 to 1. The entities on the blue branch have been changed by a common factor $r$ that depends on the value $c$. \begin{equation} \left\{\begin{matrix} {\hat{\pi}}'_{k} =\hat{\pi}_{k}; & k \notin \text{colored branches}\\ {\hat{\pi}}'_{k} = r \hat{\pi}_{k}; & k \in \text{ Blue branch } \\ {\hat{\pi}}'_{k} = c \hat{\pi}_{k}; & k \in \text{ Orange branch} \end{matrix}\right. (\#eq:s3) \end{equation} where $r = \frac{(1-c)\sum_{k \in Orange} \hat{\pi}_{k} + \sum_{k \in Blue} \hat{\pi}_{k}}{\sum_{k \in Blue} \hat{\pi}_{k}}$ and $0 < c < 1$. To summarize, there are 3 simulation scenarios available in the function `simData`. The differential abundance is achieved by swapping the proportions of two selected branches in group G2. This would make the entities that are not selected keep their abundance level across groups (phenotypic outcomes). ```r knitr::include_graphics("scene.png")
In Figure \@ref(fig:scene), the height of the bar represent the proportion. We see the tree structure as three parts, the two selected branches and the rest. The phenotypic outcome is represented by the group (G1 or G2). The proportions of the orange and blue branches are swapped between different groups in scenario S1 and S2. The height of the orange bar in the group G2 is equal to the blue bar in the group G1, and the height of the blue bar in group G2 is the same as that of the orange bar in group G1. In S1, the increase or decrease in the proportion is evenly distributed to the entities on the same branch by a common factor $r$ (on blue branch) or $\frac{1}{r}$ (on orange branch). However, in S2, the change in the proportion is not evenly distributed and the factor $r_k$ varies on the entities. In S3, the swap could be partial and depends on the value $c$. That's why the height of the orange bar in group G2 could be different to that of the blue bar in group G1.
The orange and blue branches are also referred as the signal branches in this vignette.
r Biocpkg("treeAGG")
provides a function simData
to simulate different
abundance patterns on the tree by swapping the relative abundance of two
branches. To simulate a count table, a hierarchical structure and a real count
table of entities are required. Here, we use a real throat microbiome data set
that includes a count table of 856 OTUs (operational taxonomic units) from 60
samples and a phylogenetic tree with 856 leaf nodes and 855 internal nodes
[@Charlson2010]. More details about the data could be seen via
?GUniFrac::throat.tree
and ?GUniFrac::throat.otu.tab
.
We could load the real data as below.
library(GUniFrac)
# the hierarchical struture data("throat.tree") # the count table data("throat.otu.tab")
If we already know for which two branches we want to swap the relative
abundance, we can feed simData
the real data via two arguments tree
and
data
, and the branches via from.A
and from.B
as below. How to manually
decide branches to swap could be found in section \@ref(selectBranch).
throat.otu.tab
has the samples in the rows and entities in the columns, so we
need to transpose it before input to simData
. To get reproducible result, the
random seed number is set.
# simulate count table set.seed(1122) datA <- simData(tree = throat.tree, data = as.matrix(t(throat.otu.tab)), from.A = 1678, from.B = 1613) datA class(datA)
The output datA is a leafSummarizedExperiment
object. More details about the
leafSummarizedExperiment
class could be found in the help page
?leafSummarizedExperiment
or in the vignette Introduction to
leafSummarizedExperiment and treeSummarizedExperiment.
The tree and count table could also be combined into a
leafSummarizedExperiment
object before using as the input for simData
.
lse <- leafSummarizedExperiment(tree = throat.tree, assays = list(t(throat.otu.tab))) lse set.seed(1122) datB <- simData(obj = lse, from.A = 1678, from.B = 1613)
# Two results are exactly the same all.equal(datA, datB)
The results, datA
and datB
, are exactly the same. Each includes a simulated
count table in assays
and the original input tree throat.tree as tree
in
metadata
. Additional information, such as fold changes of entities on the leaf
level (FC
), the information of branches that have differential abundance
(branch
) and the simulation scenario (scenario
), is also stored in
metadata
.
Estimating parameters from real data takes a while. In the vignette, we will do
simulation several times using the same real data. To save time, the function
parEstimate
is used to do the estimation once and save results in the
metadata
of the leafSummarizedExperiment
object for later use.
names(metadata(lse)) # the parameters are saved in the metadata as assays.par lse <- parEstimate(data = lse) names(metadata(lse))
The function viewSim
can be used to view the differential abundance pattern on
the tree structure. The simulation result is assigned to the obj
. The tree
could be shown in different shapes via layout
. If the tree is large, we could
shrink branches without differential abundance via zoomScale
to save space.
viewSim(obj = datA, layout = "rectangular", zoomScale = 1/100)
If more simulated tables are desired with the same simulation setting, the
argument n
could be used.
# simulate 2 count tables with the same setting set.seed(1122) datC <- simData(obj = lse, from.A = 1678, from.B = 1613, n = 2) length(assays(datC))
If we can't decide which branches to have the differential abundance pattern or
only decide one of them, we could let simData
randomly select the suitable
branches. For example, we can select two branches where one of them has twice
the abundance of the other (ratio = 2
).
# set seed to have reproducible results # branches are not given set.seed(1122) dat1 <- simData(obj = lse, ratio = 2)
The information about the selected branches is given in the branch
slot of the
metadata
. A
and B
are the labels of nodes that connect the selected
branches with the other part of the tree. The value in ratio
is the ratio of
the abundance of branch B
to that of branch A
. It doesn't have to be exactly
the same as the value specified in the argument ratio
. When there are not two
branches with exactly this ratio availabe in the tree, branches with value
closest to the value are found. The number of leaves in these two branches are
given in A_tips
and B_tips
. The relative abundance of the two branches are
in A_prop
and B_prop
in percentage.
metadata(dat1)$branch
Use viewSim
to visualize the signal pattern on the tree structure.
viewSim(obj = dat1, layout = "rectangular", zoomScale = 1/50)
If we want to select branches from those with the number of leaves in a given
interval, it can be achieved by changing the values of minTip.A
, maxTip.A
,
minTip.B
and maxTip.B
. Here, one branch is limited to have at least 20
leaves and at most 40 leaves, the other is to have at least 20 and at most 100.
set.seed(1122) dat2 <- simData(obj = lse, minTip.A = 20, maxTip.A = 40, minTip.B = 20, maxTip.B = 100, ratio = 2)
viewSim(obj = dat2, layout = "rectangular", zoomScale = 1/100)
Similarly, if branches with a certain abundance level are desired, it can be
done via minPr.A
and maxPr.A
. These two arguments restrict the count
proportion level of branch A. One could combine using these two arguments with
ratio
to restrict the branch B.
set.seed(1122) dat3 <- simData(obj = lse, minTip.A = 20, maxTip.A = 40, minTip.B = 20, maxTip.B = 100, maxPr.A = 0.01, ratio = 2) rb <- rbind(metadata(dat2)$branch, metadata(dat3)$branch) rownames(rb) <- c("dat2", "dat3") rb
The proportion of branch A is r rb[2, "A_prop"]
in dat3
. This fulfills the
requirement that the relative abundance level of branch A isn't above $0.01$
(maxPr.A
= 0.01).
viewSim(obj = dat3, layout = "rectangular", zoomScale = 1/100)
The ratio in dat3
is r metadata(dat3)$branch$ratio
, which is different from
the value given to the argument ratio = 2
. It's because there is not any pair
of branches which has ratio exactly equal to 2. The pair having the value
closest to 2 will be returned.
If we want to fix one branch, from.A
can be used.
# use the branch A selected in dat2. set.seed(1122) brA <- metadata(dat2)$branch$A dat4 <- simData(from.A = brA, obj = lse, minTip.A = 20, maxTip.A = 40, minTip.B = 20, maxTip.B = 100, ratio = 4) # show the selected branches rb <- rbind(metadata(dat2)$branch, metadata(dat4)$branch) rownames(rb) <- c("dat2", "dat4") rb
Branch A in dat2
and dat4
are the same.
We can simulate different differential abundance patterns by changing the
scenario
. The default scenario is "S1".
# use the branches from dat4 # change scenario to S2 set.seed(1122) dat5 <- simData(from.A = metadata(dat4)$branch$A, from.B = metadata(dat4)$branch$B, obj = lse, scenario = "S2")
If scenario = "S3"
is used, the argument adjB
could also be used. It is the
$c$ value in formula \@ref(eq:s3) and could be chosen between 0 and 1. If it's
set as NULL, the swap is similar to S1 in that the proportions of the selected
branches are swapped. Note that the blue branch here is not the branch specified
by from.A
. It should be only part of from.A
as colored with blue in the S3
of Figure \@ref(fig:scenarioDiff). Users could use pct
equal to a value
between 0 and 1 to decide the percentage of leaves in from.A
should be
selected as the blue branch.
# use the branches from dat4 # change scenario to S3 set.seed(1122) dat6 <- simData(from.A = metadata(dat4)$branch$A, from.B = metadata(dat4)$branch$B, obj = lse, pct = 0.6, adjB = NULL, scenario = "S3")
If users are interested, the values of $c$ and $r$ in formula \@ref(eq:s3) are
stored as FC
in the metadata
. The names of fc are the node labels and the
numeric values are the fold changes of corresponding nodes. Each leaf node has a
fold change value. Those have values above one ($r$) are in the blue branch and
below one in the orange branch ($c$) (see S3 in Figure \@ref(fig:scenarioDiff))
fc <- metadata(dat6)$FC
p1 <- viewSim(obj = dat4, layout = "circular", zoomScale = 1/200, legend.theme = list(legend.position = c(0.6, 0.4), legend.background = element_rect( fill="transparent")), legend.title = "S1") p2 <- viewSim(obj = dat5, layout = "circular", zoomScale = 1/200, legend.theme = list(legend.position = c(0.6, 0.4), legend.background = element_rect( fill="transparent")), legend.title = "S2") p3 <- viewSim(obj = dat6, layout = "circular", zoomScale = 1/200, legend.theme = list(legend.position = c(0.6, 0.4), legend.background = element_rect( fill="transparent")), legend.title = "S3")
library(cowplot) plot_grid(p1, p3) plot_grid(p2)
In scenario S3, points in a colored branch have the same color but different sizes. This means the fold changes of entities in that branch are different in values but same in the direction.
It's quite normal that a pair of branches that exactly fulfill the requirements
specified can't be found, especially when a small tree is provided. To
efficiently select suitable branches, we could summarize the information of
branches before the simulation using selNode
and manually select one or two
branches.
The function selNode
could be used to list all branches (all = TRUE
) that
have the number of leaf nodes between 10 (minTip = 10
) and 15 (maxTip = 15
)
and the relative abundance between 0.03 (minPr = 0.03
) and 0.05 (maxPr =
0.05
) as below.
(sel1 <- selNode(obj = lse, minTip = 11, maxTip = 12, minPr = 0.03, maxPr = 0.05, all = TRUE))
The result has listed all branches that meet the requirements. The branch nodes
of these branches are given in the column nodeNum
(the node number), nodeLab
(the node label), and nodeLab_alias
(the alias of the node label). Here, the
column nodeLab
has NA
value because the tree has no label for the nodes
selected. The relative abundance of these branches, which is a value between 0
and 1, is given in the column proportion
. The numbers of descendant leaves on
those branches are in column numTip
.
We could take one of the branches, for example, the branch on the first row, as
the fixed branch. As a pair of branches is required in the simulation, the other
branch could be selected randomly by simData
or manually as below. Let's say
the other branch should have roughly twice abundance as the currently selected
branch (minPr = 0.06, maxPr = 0.07
). Of course, we don't want a branch that
has overlap with the currently selected one. For this reason, skip
is used.
(sel2 <- selNode(obj = lse, minPr = 0.06, maxPr = 0.07, skip = sel1$nodeLab_alias[1], all = TRUE))
With the information above, we could take a branch that meets the requirement of the abundance level and has propor size (number of desendant leaves).
If the sibling branch of a selected branch is need, the function findSibling
could be used. The node that connects the sibling branch and the other part of
the tree is found and the its node number is returned. The node number is named
with the node label. If the tree has no label for the output node and use.alias
= FALSE
is specified, NA
is used as the name; otherwise, an alias of the node
label is used (use.alias = TRUE
). An alias of a node label is created by
prefixing "Node_" to the node number for internal nodes and "Leaf_" for leaf
nodes.
findSibling(tree = metadata(lse)$tree, input = sel1$nodeLab_alias[3], use.alias = TRUE)
If users are curious about the descendant leaves of the selected branch, the
function findOS
could be used. To see the descendants of a specific node, we
could put the node number or label in ancestor
.
findOS(tree = metadata(lse)$tree, ancestor = sel2$nodeNum[6])
Proof: $$\sum_{g \in Blue} \hat{\pi}'{g} = \sum{g \in Orange} \hat{\pi}{g}$$ $$\begin{aligned} \sum{g \in Blue} \hat{\pi}'{g} &= \sum{g \in Blue} r_g \hat{\pi}{g} \ &= \sum{g \in Blue} \hat{\pi}{g} r_g \ &= \sum{g \in Blue} \hat{\pi}{g}(1 + \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g) \ &= \sum{g \in Blue} \hat{\pi}{g}(1 + \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g) \ &= \sum{g \in Blue} \hat{\pi}{g} + \sum{g \in Blue} \hat{\pi}{g} \frac{\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}}{\sum{t \in Blue} \hat{\pi}{t} u_t} u_g \ &= \sum{g \in Blue} \hat{\pi}{g} + (\sum{t \in Orange} \hat{\pi}{t} - \sum{t \in Blue} \hat{\pi}{t}) \frac{\sum{g \in Blue} \hat{\pi}{g} u_g}{\sum{t \in Blue} \hat{\pi}{t} u_t} \ & = \sum{t \in Orange} \hat{\pi}{t} \ & = \sum{g \in Orange} \hat{\pi}_{g} \end{aligned}$$
Proof: $$\sum_{g \in Orange} \hat{\pi}'{g} = \sum{g \in Blue} \hat{\pi}_{g}$$ Similar to the proof above.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.