This document demonstrates how to use the TreeMHN
package. The package is created based on the article "Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity or co-occurrence from tumor mutation trees".
library(TreeMHN) library(parallel) library(ggplot2) library(ggpubr)
The function generate_trees
can generate a random Mutual Hazard Network $\Theta$ and a set of mutation trees from $\Theta$ according to the tree generating process introduced in the paper.
set.seed(6666) n <- 10 # number of events N <- 500 # number of samples lambda_s <- 1 # sampling event rate gamma <- 0.5 # penalty parameter sparsity <- 0.5 # sparsity of the MHN exclusive_ratio <- 0.5 # proportion of inhibiting edges # This function will generate a random MHN along with a collection of mutation trees tree_obj <- generate_trees(n = n, N = N, lambda_s = lambda_s, sparsity = sparsity, exclusive_ratio = exclusive_ratio) true_Theta <- tree_obj$Theta # true MHN trees <- tree_obj$trees # extract trees
We can plot one of the trees using the plot_tree_list
function:
tree <- trees[[66]] plot_tree_list(tree, tree_obj$mutations)
The function learn_MHN
takes a TreeMHN
object and learns an MHN $\hat{\Theta}$.
pred_Theta <- learn_MHN(tree_obj, gamma = gamma, lambda_s = lambda_s, verbose = TRUE)
# Function to subsample half of the trees and learn one MHN subsample_size <- floor(N/2) subsample_once <- function(subsample_size, tree_obj, gamma) { subsample_trees <- sample(tree_obj$trees, subsample_size) tree_obj$trees <- subsample_trees tree_obj$N <- subsample_size tree_obj$N_patients <- subsample_size tree_obj$weights <- rep(1, subsample_size) pred_Theta <- learn_MHN(tree_obj, gamma) return(pred_Theta) } # Function to get a vector of non-selected (masked) elements get_mask <- function(n, SS_res, thr = 0.95) { to_mask <- sapply(SS_res, function (x) as.vector(x)) to_mask[abs(to_mask) > 1e-2] <- 1 to_mask[to_mask != 1] <- 0 to_mask <- which(matrix(apply(to_mask,1,mean) > thr,nrow = n) == 0) return(to_mask) }
It is recommended to run the stability selection procedure using the parallel
package (runtime up to 10 min depending on the number of cores available).
SS_res <- mclapply(c(1:1000), function(i) subsample_once(subsample_size, tree_obj, gamma), mc.cores = detectCores()) TreeMHN_to_mask <- get_mask(n, SS_res, 0.99) pred_Theta_w_SS <- learn_MHN(tree_obj, gamma = gamma, to_mask = TreeMHN_to_mask)
We first plot the true MHN and the two estimated MHNs by ordering the entries based on the true baseline rates. At a regularization level $\gamma = 0.5$, most entries at the top left corner are recovered. Without stability selection, there are many false non-zero entries due to overfitting, including some very small values at the top left corner. With stability selection, we see that those entries are removed, along with some true positive entries at the lower left corner.
idx <- order(diag(true_Theta),decreasing = TRUE) top_idx <- idx[c(1:(n/2))]
g1 <- annotate_figure(plot_Theta(true_Theta, to_show = idx), top = "True MHN") g2 <- annotate_figure(plot_Theta(pred_Theta, to_show = idx), top = "TreeMHN") g3 <- annotate_figure(plot_Theta(pred_Theta_w_SS, to_show = idx), top = "TreeMHN with stability selection")
ggarrange(g1, g2, g3, nrow = 1, ncol = 3)
We can compute the precision and recall (= true positive rate) based on the off-diagonal differences using function compare_Theta
. (See the Supplementary Material for more details.)
$\Theta$ (left) $\hat{\Theta}$ (top) | $i \quad j$ | $i \rightarrow j$ | $i \dashv j$
------------------------------------- | ----------- | ----------------- | -------------
$i \quad j$ | TN | FP | FP
$i \rightarrow j$ | FN | TP | FP
$i \dashv j$ | FN | FP | TP
compare_Theta(true_Theta, pred_Theta)
compare_Theta(true_Theta, pred_Theta_w_SS)
If we focus on the first half of the events with higher baseline rates, we can see an increase in recall/TPR.
compare_Theta(true_Theta[top_idx,top_idx], pred_Theta[top_idx,top_idx])
compare_Theta(true_Theta[top_idx,top_idx], pred_Theta_w_SS[top_idx,top_idx])
Here we use the tree dataset from Morita et al. (2020).
# note that we summarize the mutations at the gene level load("./Data/AML_morita/AML_tree_obj.RData") plot_tree_df(AML$tree_df[AML$tree_df$Tree_ID == match("AML-38_AML-38-001", AML$tree_labels),], AML$mutations, "AML-38_AML-38-001")
To use another dataset, please make sure it is in dataframe format with five columns:
Patient_ID
: IDs of patients, unique for each patient;
Tree_ID
: IDs of mutation trees, unique within each patient;
Node_ID
: IDs of each node in the tree, including the root node (with ID "1"), unique for each node;
Mutation_ID
: IDs of each mutational event. The root node has a mutation ID of "0", and other mutation IDs can be duplicated in the tree to allow for parallel mutations;
Parent_ID
: IDs of the parent node ID. The root node has itself as parent (ID "1").
For example,
head(AML$tree_df)
To convert a dataframe to an TreeMHN
object, use the input_tree_df
function. For example,
# not run # input_tree_df(n = AML$n, tree_df = AML$tree_df, mutations = AML$mutations, tree_labels = AML$tree_labels)
To ensure enough precision, we run stability selection with $\gamma = 0.1$ and a threshold of $95\%$ and obtain a vector of non-selected elements over $1000$ subsamples. Again, we recommend to run the code using the parallel
package on a cluster (runtime up to 20 min depending on the number of cores available).
RNGkind("L'Ecuyer-CMRG") gamma <- 0.1 subsample_size <- floor(AML$N / 2) SS_res <- mclapply(c(1:1000), function(i) subsample_once(subsample_size, AML, gamma), mc.cores = detectCores(), mc.set.seed = TRUE) to_mask <- get_mask(AML$n, SS_res, 0.95)
Then we refit the model by masking the non-selected elements.
AML_Theta <- learn_MHN(AML, gamma = gamma, to_mask = to_mask)
save(AML_Theta, file = "./Data/AML_morita/AML_Theta.RData")
Now we can plot the learned MHN. The diagonal entries represent the baseline rates of mutations to occur, independent of other events. The off-diagonal entries represent the interactions between mutations.
plot_Theta(AML_Theta, AML$mutations)
If we focus on the entries with non-zero off-diagonal entries, we get:
plot_Theta(AML_Theta, AML$mutations, full = FALSE)
The plot_observed_pathways
function plots the observed trajectories sorted according to their relative frequencies. Given the estimated MHN, we can plot the most probable evolutionary trajectories using the plot_pathways_w_sampling
function.
mutation_colors <- colors()[c(7, 11, 20, 143, 419, 417, 53, 62, 43, 76, 623, 80, 81, 542, 86, 93, 96, 524, 101, 102, 364, 367, 373, 383, 387, 399, 404, 405, 411, 481, 493)] names(mutation_colors) <- AML$mutations g1 <- plot_observed_pathways(AML, AML_Theta, mutation_colors = mutation_colors) g2 <- plot_pathways_w_sampling(AML_Theta, AML$mutations, top_M = 40, mutation_colors = mutation_colors) ggarrange(g1, g2)
Given a particular tree and the estimated MHN, we can also find the next most probable mutational events using the plot_next_mutations
function.
tree_df <- AML$tree_df[AML$tree_df$Tree_ID == match("AML-09_AML-09-001", AML$tree_labels),] plot_next_mutations(AML$n, tree_df, AML_Theta, AML$mutations, "AML-09_AML-09-001", 8)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.