knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
The arrangement of hypotheses in a hierarchical structure appears in many
research fields and often indicates different resolutions at which data can be
viewed and results can be interpreted. Examples include studies of microbial
abundance where the OTUs (operational taxonomic units, often assumed to
represent species) can be arranged as leaves in a phylogenetic tree, and
single-cell studies where the individual cells can be organized hierarchically,
e.g. via clustering. In both these examples, a question arises on which
resolution level the signal in the data (e.g. differential abundance of features
between groups) is best interpreted, or put another way, on what resolution the
main 'driver' of the signal can be found. For example, in the microbial
abundance data, a whole family of species may react similarly to a particular
stimulus. In such situations, it would be more informative to summarize the
differential abundance analysis at this level of resolution, rather than
providing a long list of individual differentially abundant OTUs. An additional
reason for aggregating data to higher levels of the phylogenetic tree is that
the abundances of individual species or OTUs may be low, limiting the power of
the statistical test to pick up any significant differences, whereas by
aggregating the signal from multiple related OTUs, the detection power can be
increased. For the single-cell data example, similarly, if a treatment truly
acts on a high level, affecting the abundance of, e.g., all B-cells, it is
helpful if this is picked up by the analysis pipeline, rather than returning a
large number of significantly differentially abundant subclusters of the main
B-cell cluster. The r Biocpkg("treeAGG")
package is designed to analyze data
that can be organized in a hierarchical structure, and to find the most
informative resolution at which to interpret the signal of interest.
r Biocpkg("treeAGG")
knitr::include_graphics("Flowchart.png")
In the r Biocpkg("treeAGG")
package, we provide an easy-to-use pipeline
(Figure \@ref(fig:flowchart)) to find the most informative level of the
hierarchical structure for interpretation. The starting point for the analysis
is a hierarchical structure (e.g., a phylogenetic tree), a data matrix with
values for all leaves in the hierarchical structure, and annotation tables for
rows and columns of the matrix. From here, users can follow the standard five
steps outlined in Figure \@ref(fig:flowchart) to find the main 'driver' of the
signal. The details of the five steps are listed as below. The example of each
step is in Section \@ref(example).
Step 1: The data on the level of the leaf nodes is collected and stored in a
leafSummarizedExperiment
container.
Step 2: The data on the level of the internal nodes is generated and stored in
a treeSummarizedExperiment
container.
Step 3: The phenotypic outcome association test is performed at each node of
the tree, and a p-value is derived for each node. If the differential abundance
analysis is desired, runEdgeR
, a wrapper using functions from the
r Biocpkg("treeAGG")
package, could be used. Users are free to choose any
suitable software to do their customized analysis. What to be expected is that a
p-value is derived for each node.
Step 4: The min-P algorithm is applied to do tree aggregation.
Step 5: The result is printed out.
To do customized analysis, the extended data matrix can be extracted after step 2, and the results of which can be integrated back into the pipeline to do step 4 and 5. Note that the table extracted has more rows than the original table because it includes data for the entities that represent the internal nodes of the tree. In each step, we have listed the available functions to use in blue texts.
We suggest users to come back to this workflow (Figure \@ref(fig:flowchart)) after they read Section \@ref(example).
Consider the case that we have a table that contains measurments of entities collected from samples with different phenotypic outcomes, and a tree that represents the hierarchical structure of the entities. Entities are in the rows of the table, and samples are in the columns. Each entity could be mapped to a node of the tree.
To arrange hypotheses in the hierarchical structure, we need that each node of the tree could be mapped to a row of the table. However, in most case, only the data of entities corresponding to the leaf nodes of the tree could be observed. For the internal nodes, the data has to be generated from that of leaf nodes. Users could decide how to reasonably create the data for the internal nodes. For the abundance data, we suggest to create the value for an internal node by summing the count of its descendant leaf nodes.
With the data ready, we test at each node of the tree the null hypothesis (H0: There is no association between the differential abundance and the phenotypic outcome.) Multiple testing correction methods, such as the Benjamin-Hochberg procedure, could be applied then to decide whether the null hypothesis at a node should be rejected.
It's likely that an internal node found to be significantly associated with the phenotypic outcome is driven by some of its descendant nodes. In other words, only some of its descendant nodes are phenotypic outcome associated, and this signal is not diluted enough by other descendant nodes that are not phenotypic outcome associated. In the tree aggregation step, we try to pinpoint the nodes that drive the association.
The packages below are required.
suppressPackageStartupMessages({ library(treeAGG) library(edgeR) library(S4Vectors) library(ggtree)})
In this section, a toy data is used to show how to perform the standard five steps outlined in Figure \@ref(fig:flowchart).
At the begining, we have a tree structure (tinyTree), a table (toyData) and the annotation data for the rows and columns of the table (rowD and colD). Entities are in the rows of the table toyData and samples are in the columns. Each entity could be mapped to a leaf node of the tree. The first five samples belong to the group G1, and the other five to G2.
The table is generated in a way that only the first five entities have differential proportion among groups. The back ground of the branches that the five entities on is colored as cyan. Our goal is find the optimal level on the tree, indicated by the violet squares, to interpret this difference.
# tree data("tinyTree")
ggtree(tinyTree) + geom_text2(aes(subset = isTip, label = label), hjust = -0.2) + geom_hilight(node = 18, fill = "cyan", alpha = 0.1) + geom_hilight(node = 13, fill = "cyan", alpha = 0.1) + geom_point2(aes(subset = (node %in% c(13, 18))), size = 4, color = "darkviolet", shape = 23, stroke = 3)
# table # p1 and p2 differ in the first five values p1 <- rep(0.1, 10) p2 <- c(rep(0.05, 3), rep(0.175, 2), rep(0.1, 5) ) set.seed(1) # randomly generate the table from multinomial distribution toyData <- cbind(rmultinom(n = 5, size = 1000, prob = p1), rmultinom(n = 5, size = 1000, prob = p2)) # annotation data for rows # The true differential information between groups is in Diff rowD <- DataFrame(nodeLab = tinyTree$tip.label, Diff = rep(c(TRUE, FALSE), each = 5)) # annotation data for columns colD <- DataFrame(trt = sample(letters[1:3], size = 10, replace = TRUE), group = rep(c("G1", "G2"), each = 5))
All data above could be stored in a leafSummarizedExperiment
container using
the function leafSummarizedExperiment
. More details about the
leafSummarizedExperiment
class could be found in the help page
?leafSummarizedExperiment
or in the vignette Introduction to
leafSummarizedExperiment and treeSummarizedExperiment.
lse <- leafSummarizedExperiment(assays = list(toyData), rowData = rowD, colData = colD, tree = tinyTree)
The data in Section \@ref(step1) is on the level of the leaf nodes. To
generate data for the internal nodes, the function nodeValue
can be used.
Users could decide how to generate the values for internal nodes via fun
.
Here, we consider the value at an internal node to be the sum of the values at
its descendants (fun = sum
).
tse <- nodeValue(data = lse, fun = sum)
The output tse is a treeSummarizedExperiment
object. More details about the
treeSummarizedExperiment
class could be found in the help page
?treeSummarizedExperiment
or in the vignette Introduction to
leafSummarizedExperiment and treeSummarizedExperiment.
With the data ready, the statistical analysis could be performed using suitable
softwares. Here, runEdgeR
is shown as an example to do differential abundance
analysis. It's a wrapper of functions from r Biocpkg("edgeR")
. The detailed
analysis performed is shown step by step in Section \@ref(pline2).
new_tse <- runEdgeR(obj = tse, use.assays = 1, design = NULL, contrast = NULL, normalize = TRUE, method = "TMM", adjust.method = "BH")
To do analysis, we assign tse to the obj
argument. If there multiple
matrix-like elements in the assays
of tse, we could use use.assays
to
indicate which elements are used for the analysis. Here, the first one is used
(use.assays = 1
). The design matrix and contrasts can be customized
correspondingly via design
and contrast
, and they will be saved in the
metadata
of new_tse for later check. If the design matrix is not given
(design = NULL
), colData
is used to generate a design matrix. If the
contrast is not provided (contrast = NULL
), the last coefficient is tested in
the glmLRT
step (see ?edgeR::glmLRT
and Section \@ref(pline2)).
To print out the result, topNodes
can be used. The output is a list
. More
details about its structure could be found in Section \@ref(mtab).
(tN1 <- topNodes(data = new_tse))
The tree aggregation can be done in one step using treeAGG
.
outP <- treeAGG(data = new_tse, agg.by = "PValue", sigf.by = "FDR", sigf.limit = 0.05, message = TRUE)
Based on the analysis result (data = new_tse
), we specify the name of the
column in tN1 for aggregation via agg.by
, and the name of the column storing
the adjusted p-value via sigf.by
. The threshold for the adjusted p-value can
be decided via sigf.limit
. To see the running process, message = TRUE
is
used.
We could use topNodes
again to print out the result after aggregation. If some
columns in rowData
and linkData
are of interest, they could be printed out
too via col.rowData
and col.linkData
, respectively.
(tN2 <- topNodes(data = outP, col.rowData = "Diff", col.linkData = "nodeNum"))
Compared to tN1, tN2 has three more columns, aggKeep, Diff and
nodeNum. The column aggKeep is created in the step of the tree aggregation,
the other two are from rowData
and linkData
as we specified in the
arguments.
The next step is to check whether the optimal level for the interpretation is
found. The true differential information is stored in the column Diff
of
rowData
.
# Extract data having nodes that are truely differentially abundant outP_sel <- outP[rowData(outP)$Diff %in% TRUE, ] # the true optimal level of the signal # use signalNode to remove the redundant nodes truth <- signalNode(tree = treeData(outP_sel), node = linkData(outP_sel)$nodeNum)
The estimated optimal level to interpret the differential abundance pattern from
r Biocpkg("treeAGG")
is as below.
# the estimated signal level res <- tN2$result_assay1$contrastNULL found <- res$nodeNum[res$aggKeep]
The node number (nodeNum
) instead of the node label is extracted because
internal nodes might not have labels in the tree provided.
p <- treePlot(tree = treeData(outP) , branch = truth, point = found, layout = "rectangular") # using the function geom_text2 from the ggtree packages to add labels p + geom_text2(aes(label = label), hjust = -0.3)
We use treePlot
(?treePlot
) to visualize the results. The branches with blue
edges are truly differentially abundant. The results obtained from the minP
aggregation algorithm implemented in r Biocpkg("treeAGG")
are shown as orange
points. Compared to Figure \@ref(fig:sigTree), the orange points are exactly in
the same location as the violet squares. That indicates the true signal is found
correctly. More details about how to use r Biocpkg("ggtree")
could be seen
here.
As described in Figure \@ref(fig:flowchart), the table stored in assays
could
be extracted to do customized analysis. Note: To make sure we know which row
corresponds to which node of the tree, we need to use use.nodeLab = TRUE
.
# extract the abundance table dat <- assays(tse, use.nodeLab = TRUE)[[1]]
Users are free to choose any suitable software to perform analysis for their
data. Here, we follow the routine steps of using the r Biocpkg("edgeR")
package to analyze the toy data.
# calculate total count for each sample # The total count is the sum of cell counts of clusters on the leaf level of the # tree. tip_tse <- tse[linkData(tse)$isLeaf, ] tipDat <- assays(tip_tse, use.nodeLab = TRUE)[[1]] libSize <- apply(tipDat, 2, sum) # create DGEList y <- DGEList(counts = dat, lib.size = libSize, remove.zeros = FALSE) # calculate normalisation factors y <- calcNormFactors(object = y, method = "TMM") # construct design matrix sample_inf <- colData(tse) design <- model.matrix(~ trt + group, data = sample_inf) # estimate dispersion y <- estimateDisp(y, design = design) # fit the negative binomial GLMs fit <- glmFit(y, design = design, prior.count = 0.125) # run likelihood ratio tests # contrast is not specified here, so the last coefficient is tested. lrt <- glmLRT(fit, contrast = NULL) # Use Benjamin-Hochberg method to do multiple testing correction # n is set to Inf below, because we want to have the results of all entities. out <- topTags(lrt, n = Inf, adjust.method = "BH")$table head(out)
The output out has five columns, one of which is the nominal p-value (PValue) and one is the adjusted p-value (FDR). These columns are required by the tree aggregation step.
If other softwares are used to do analysis, the output is expected to has at
least two columns that contain the nominal and adjusted p-values, and each row
representing a node of the tree. The column names are not necessary to be
PValue
and FDR
, as they could be specified in the function treeAGG
(see
Section \@ref(step4)).
To prepare for the tree aggregation, we now put the analysis result in the
rowData
of tse via updateTSE
.
outList <- list(assay1 = list(out)) new_tse1 <- updateTSE(result = outList, tse = tse, use.assays = 1, design = design, contrast = NULL, fit = list(fit))
The result out is changed to a list object outList before it is assigned to
the argument result
. To update the treeSummarizedExperiment
object (tse =
tse
), the result (result = outList
) and the information, such as the design
matrix (design
), the contrasts (contrast
) and the number of matrix-like
elements in assays
(use.assays
) that were used, should be provided. We could
also optionally store the result fit created by the glmFit
function for
later use, for example, to quickly get new result when specifying a new
contrast.
The analysis performed by runEdgeR
is exactly the same as the code in this
section.
all.equal(new_tse, new_tse1)
Now, we are back to the standard part of the pipeline and could follow Section \@ref(step4) and Section \@ref(step5) to get the final results.
It's possible to perform analysis and aggregation simultaneously on multiple
elements of assays
with different contrasts.
The data tseM with two elements in the assays
is created.
tseM <- treeSummarizedExperiment(assays = list(dat, 2*dat), tree = treeData(tse), rowData = rowData(tse), colData = colData(tse))
To use both elements, we specify use.assays = c(1, 2)
. Two different contrasts
are applied (contrast
) to the analysis of each element.
# data analysis new_tseM <- runEdgeR(obj = tseM, use.assays = c(1, 2), design = NULL, contrast = list(contrast1 = c(0, 1, -1, 0), contrast2 = c(0, 0, 0, 1)), normalize = TRUE, method = "TMM", adjust.method = "BH")
If the customized analysis was performed, the results from multiple elements could be written back in one step.
To simplify the document, we directly use the output from Section \@ref(cDA) to show how to do it.
# assume two contrasts are used in the analysis cList <- list(contrast1 = c(0, 1, -1, 0), contrast2 = c(0, 0, 0, 1)) # the results from the first element of assays # name the results as the list of the contrasts (cList) res1 <- list(contrast1 = out, contrast2 = 2*out) # the results from the second element of assays # name the results as the list of the contrasts (cList) res2 <- list(contrast1 = 0.5*out, contrast2 = 3*out) # combine the results into a list # name the results with 'assay' followed by a number # e.g., if the first and the third table in the assays were used for analysis # then, we name the result with assay1 assay3 res <- list(assay1 = res1, assay2 = res2) # keep both results and data in the container test_tseM <- updateTSE(result = res, tse = tseM, use.assays = c(1, 2), design = NULL, contrast = cList)
Now, the output could be obtained using topNodes
as mentioned before. The
output is a list
.
rD4 <- topNodes(test_tseM) class(rD4)
Each element of the list is the analysis result of a table in assays
.
# Here, we have two elements: one is the result of the first table, # and the other is the result of the second table names(rD4)
Each sub-element is a list
too. It stores the analysis result of a table in
assays
under a specific contrast.
class(rD4$result_assay1) names(rD4$result_assay1)
Here, the table below is exactly the data out that was put in.
# There are five columns head(rD4$result_assay1$contrast1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.