In cbg-ethz/TreeMHN: Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

This document demonstrates how to use the TreeMHN package. The package is created based on the article "Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity or co-occurrence from tumor mutation trees".

1 Load required packages

library(TreeMHN)
library(parallel)
library(ggplot2)
library(ggpubr)

2 Simulated data

2.1 Generate trees

The function generate_trees can generate a random Mutual Hazard Network $\Theta$ and a set of mutation trees from $\Theta$ according to the tree generating process introduced in the paper.

set.seed(6666)
n <- 10 # number of events
N <- 500 # number of samples
lambda_s <- 1 # sampling event rate
gamma <- 0.5 # penalty parameter
sparsity <- 0.5 # sparsity of the MHN
exclusive_ratio <- 0.5 # proportion of inhibiting edges

# This function will generate a random MHN along with a collection of mutation trees
tree_obj <- generate_trees(n = n, N = N, lambda_s = lambda_s, sparsity = sparsity, 
                           exclusive_ratio = exclusive_ratio)
true_Theta <- tree_obj$Theta # true MHN
trees <- tree_obj$trees # extract trees

We can plot one of the trees using the plot_tree_list function:

tree <- trees[[66]]
plot_tree_list(tree, tree_obj$mutations)

2.2 Learn MHN from the generated trees

The function learn_MHN takes a TreeMHN object and learns an MHN $\hat{\Theta}$.

Without stability selection:

pred_Theta <- learn_MHN(tree_obj, gamma = gamma, lambda_s = lambda_s, verbose = TRUE)

With stability selection (Meinshausen and Bühlmann (2010)):

# Function to subsample half of the trees and learn one MHN
subsample_size <- floor(N/2)
subsample_once <- function(subsample_size, tree_obj, gamma) {
  subsample_trees <- sample(tree_obj$trees, subsample_size)
  tree_obj$trees <- subsample_trees
  tree_obj$N <- subsample_size
  tree_obj$N_patients <- subsample_size
  tree_obj$weights <- rep(1, subsample_size)
  pred_Theta <- learn_MHN(tree_obj, gamma)
  return(pred_Theta)
}

# Function to get a vector of non-selected (masked) elements 
get_mask <- function(n, SS_res, thr = 0.95) {
  to_mask <- sapply(SS_res, function (x) as.vector(x))
  to_mask[abs(to_mask) > 1e-2] <- 1
  to_mask[to_mask != 1] <- 0
  to_mask <- which(matrix(apply(to_mask,1,mean) > thr,nrow = n) == 0)
  return(to_mask)
}

It is recommended to run the stability selection procedure using the parallel package (runtime up to 10 min depending on the number of cores available).

SS_res <- mclapply(c(1:1000), 
                   function(i) subsample_once(subsample_size, tree_obj, gamma), 
                   mc.cores = detectCores())
TreeMHN_to_mask <- get_mask(n, SS_res, 0.99)
pred_Theta_w_SS <- learn_MHN(tree_obj, gamma = gamma, to_mask = TreeMHN_to_mask)

2.3 Performance assessment

We first plot the true MHN and the two estimated MHNs by ordering the entries based on the true baseline rates. At a regularization level $\gamma = 0.5$, most entries at the top left corner are recovered. Without stability selection, there are many false non-zero entries due to overfitting, including some very small values at the top left corner. With stability selection, we see that those entries are removed, along with some true positive entries at the lower left corner.

idx <- order(diag(true_Theta),decreasing = TRUE)
top_idx <- idx[c(1:(n/2))]

g1 <- annotate_figure(plot_Theta(true_Theta, to_show = idx), top = "True MHN")
g2 <- annotate_figure(plot_Theta(pred_Theta, to_show = idx), top = "TreeMHN")
g3 <- annotate_figure(plot_Theta(pred_Theta_w_SS, to_show = idx), top = "TreeMHN with stability selection")

ggarrange(g1, g2, g3, nrow = 1, ncol = 3)

We can compute the precision and recall (= true positive rate) based on the off-diagonal differences using function compare_Theta. (See the Supplementary Material for more details.)

$\Theta$ (left) $\hat{\Theta}$ (top) | $i \quad j$ | $i \rightarrow j$ | $i \dashv j$ ------------------------------------- | ----------- | ----------------- | ------------- $i \quad j$ | TN | FP | FP
$i \rightarrow j$ | FN | TP | FP
$i \dashv j$ | FN | FP | TP

Without stability selection:

compare_Theta(true_Theta, pred_Theta)

With stability selection:

compare_Theta(true_Theta, pred_Theta_w_SS)

If we focus on the first half of the events with higher baseline rates, we can see an increase in recall/TPR.

Without stability selection:

compare_Theta(true_Theta[top_idx,top_idx], pred_Theta[top_idx,top_idx])

With stability selection:

compare_Theta(true_Theta[top_idx,top_idx], pred_Theta_w_SS[top_idx,top_idx])

3 Real data

3.1 Input dataset

Here we use the tree dataset from Morita et al. (2020).

# note that we summarize the mutations at the gene level
load("./Data/AML_morita/AML_tree_obj.RData")
plot_tree_df(AML$tree_df[AML$tree_df$Tree_ID == match("AML-38_AML-38-001", AML$tree_labels),],
             AML$mutations, "AML-38_AML-38-001")

To use another dataset, please make sure it is in dataframe format with five columns:

Patient_ID: IDs of patients, unique for each patient;
Tree_ID: IDs of mutation trees, unique within each patient;
Node_ID: IDs of each node in the tree, including the root node (with ID "1"), unique for each node;
Mutation_ID: IDs of each mutational event. The root node has a mutation ID of "0", and other mutation IDs can be duplicated in the tree to allow for parallel mutations;
Parent_ID: IDs of the parent node ID. The root node has itself as parent (ID "1").

For example,

head(AML$tree_df)

To convert a dataframe to an TreeMHN object, use the input_tree_df function. For example,

# not run
# input_tree_df(n = AML$n, tree_df = AML$tree_df, mutations = AML$mutations, tree_labels = AML$tree_labels)

3.2 Learn the MHN

To ensure enough precision, we run stability selection with $\gamma = 0.1$ and a threshold of $95\%$ and obtain a vector of non-selected elements over $1000$ subsamples. Again, we recommend to run the code using the parallel package on a cluster (runtime up to 20 min depending on the number of cores available).

RNGkind("L'Ecuyer-CMRG")
gamma <- 0.1
subsample_size <- floor(AML$N / 2)
SS_res <- mclapply(c(1:1000), 
                   function(i) subsample_once(subsample_size, AML, gamma), 
                   mc.cores = detectCores(), 
                   mc.set.seed = TRUE)
to_mask <- get_mask(AML$n, SS_res, 0.95)

Then we refit the model by masking the non-selected elements.

AML_Theta <- learn_MHN(AML, gamma = gamma, to_mask = to_mask)

save(AML_Theta, file = "./Data/AML_morita/AML_Theta.RData")

3.3 Plot the network

Now we can plot the learned MHN. The diagonal entries represent the baseline rates of mutations to occur, independent of other events. The off-diagonal entries represent the interactions between mutations.

plot_Theta(AML_Theta, AML$mutations)

If we focus on the entries with non-zero off-diagonal entries, we get:

plot_Theta(AML_Theta, AML$mutations, full = FALSE)

The plot_observed_pathways function plots the observed trajectories sorted according to their relative frequencies. Given the estimated MHN, we can plot the most probable evolutionary trajectories using the plot_pathways_w_sampling function.

mutation_colors <- colors()[c(7, 11, 20, 143, 419,
                              417, 53, 62, 43, 76,
                              623, 80, 81, 542, 86,
                              93, 96, 524, 101, 102,
                              364, 367, 373, 383, 387,
                              399, 404, 405, 411, 481, 493)]
names(mutation_colors) <- AML$mutations

g1 <- plot_observed_pathways(AML, AML_Theta, mutation_colors = mutation_colors)
g2 <- plot_pathways_w_sampling(AML_Theta, AML$mutations, top_M = 40, mutation_colors = mutation_colors)
ggarrange(g1, g2)

Given a particular tree and the estimated MHN, we can also find the next most probable mutational events using the plot_next_mutations function.

tree_df <- AML$tree_df[AML$tree_df$Tree_ID == match("AML-09_AML-09-001", AML$tree_labels),]
plot_next_mutations(AML$n, 
                    tree_df,
                    AML_Theta,
                    AML$mutations,
                    "AML-09_AML-09-001",
                    8)