TreeMHN: Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

This document demonstrates how to use the TreeMHN package. The package is created based on the article “Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity or co-occurrence from tumor mutation trees”.

1 Load required packages

library(TreeMHN)
library(parallel)
library(ggplot2)
library(ggpubr)

2 Simulated data

The function generate_trees can generate a random Mutual Hazard Network (\Theta) and a set of mutation trees from (\Theta) according to the tree generating process introduced in the paper.

set.seed(6666)
n <- 10 # number of events
N <- 500 # number of samples
lambda_s <- 1 # sampling event rate
gamma <- 0.5 # penalty parameter
sparsity <- 0.5 # sparsity of the MHN
exclusive_ratio <- 0.5 # proportion of inhibiting edges

# This function will generate a random MHN along with a collection of mutation trees
tree_obj <- generate_trees(n = n, N = N, lambda_s = lambda_s, sparsity = sparsity, 
                           exclusive_ratio = exclusive_ratio)
true_Theta <- tree_obj$Theta # true MHN
trees <- tree_obj$trees # extract trees

We can plot one of the trees using the plot_tree_list function:

tree <- trees[[66]]
plot_tree_list(tree, tree_obj$mutations)

The function learn_MHN takes a TreeMHN object and learns an MHN (\hat{\Theta}).

Without stability selection:

pred_Theta <- learn_MHN(tree_obj, gamma = gamma, lambda_s = lambda_s, verbose = TRUE)

## Initializing Theta...
## Checking whether MCEM is needed...
## Running MLE...
## iter   10 value 3534.182513
## iter   20 value 3520.105848
## iter   30 value 3515.849267
## iter   40 value 3514.837437
## iter   50 value 3514.370185
## final  value 3514.281048 
## converged

With stability selection (Meinshausen and Bühlmann (2010)):

# Function to subsample half of the trees and learn one MHN
subsample_size <- floor(N/2)
subsample_once <- function(subsample_size, tree_obj, gamma) {
  subsample_trees <- sample(tree_obj$trees, subsample_size)
  tree_obj$trees <- subsample_trees
  tree_obj$N <- subsample_size
  tree_obj$N_patients <- subsample_size
  tree_obj$weights <- rep(1, subsample_size)
  pred_Theta <- learn_MHN(tree_obj, gamma)
  return(pred_Theta)
}

# Function to get a vector of non-selected (masked) elements 
get_mask <- function(n, SS_res, thr = 0.95) {
  to_mask <- sapply(SS_res, function (x) as.vector(x))
  to_mask[abs(to_mask) > 1e-2] <- 1
  to_mask[to_mask != 1] <- 0
  to_mask <- which(matrix(apply(to_mask,1,mean) > thr,nrow = n) == 0)
  return(to_mask)
}

It is recommended to run the stability selection procedure using the parallel package (runtime up to 10 min depending on the number of cores available).

SS_res <- mclapply(c(1:1000), 
                   function(i) subsample_once(subsample_size, tree_obj, gamma), 
                   mc.cores = detectCores())
TreeMHN_to_mask <- get_mask(n, SS_res, 0.99)
pred_Theta_w_SS <- learn_MHN(tree_obj, gamma = gamma, to_mask = TreeMHN_to_mask)

We first plot the true MHN and the two estimated MHNs by ordering the entries based on the true baseline rates. At a regularization level (\gamma = 0.5), most entries at the top left corner are recovered. Without stability selection, there are many false non-zero entries due to overfitting, including some very small values at the top left corner. With stability selection, we see that those entries are removed, along with some true positive entries at the lower left corner.

idx <- order(diag(true_Theta),decreasing = TRUE)
top_idx <- idx[c(1:(n/2))]

We can compute the precision and recall (= true positive rate) based on the off-diagonal differences using function compare_Theta. (See the Supplementary Material for more details.)

| (\Theta) (left) (\hat{\Theta}) (top) | (i \quad j) | (i \rightarrow j) | (i \dashv j) | | ---------------------------------------- | ------------- | ------------------- | -------------- | | (i \quad j) | TN | FP | FP | | (i \rightarrow j) | FN | TP | FP | | (i \dashv j) | FN | FP | TP |

Without stability selection:

compare_Theta(true_Theta, pred_Theta)

##       SHD        TP        FP        TN        FN Precision       TPR     FPR_N 
##     34.00     33.00     23.00     23.00     11.00      0.59      0.73      0.51 
##     FPR_P       MSE 
##      0.51      0.77

With stability selection:

compare_Theta(true_Theta, pred_Theta_w_SS)

##       SHD        TP        FP        TN        FN Precision       TPR     FPR_N 
##     30.00     16.00      1.00     44.00     29.00      0.94      0.36      0.02 
##     FPR_P       MSE 
##      0.02      0.89

If we focus on the first half of the events with higher baseline rates, we can see an increase in recall/TPR.

Without stability selection:

compare_Theta(true_Theta[top_idx,top_idx], pred_Theta[top_idx,top_idx])

##       SHD        TP        FP        TN        FN Precision       TPR     FPR_N 
##      7.00     10.00      7.00      3.00      0.00      0.59      1.00      0.70 
##     FPR_P       MSE 
##      0.70      0.11

With stability selection:

compare_Theta(true_Theta[top_idx,top_idx], pred_Theta_w_SS[top_idx,top_idx])

##       SHD        TP        FP        TN        FN Precision       TPR     FPR_N 
##      1.00      9.00      0.00     10.00      1.00      1.00      0.90      0.00 
##     FPR_P       MSE 
##      0.00      0.12

3 Real data

Here we use the tree dataset from Morita et al. (2020).

# note that we summarize the mutations at the gene level
load("./Data/AML_morita/AML_tree_obj.RData")
plot_tree_df(AML$tree_df[AML$tree_df$Tree_ID == match("AML-38_AML-38-001", AML$tree_labels),],
             AML$mutations, "AML-38_AML-38-001")

To use another dataset, please make sure it is in dataframe format with five columns:

Patient_ID: IDs of patients, unique for each patient;
Tree_ID: IDs of mutation trees, unique within each patient;
Node_ID: IDs of each node in the tree, including the root node (with ID “1”), unique for each node;
Mutation_ID: IDs of each mutational event. The root node has a mutation ID of “0”, and other mutation IDs can be duplicated in the tree to allow for parallel mutations;
Parent_ID: IDs of the parent node ID. The root node has itself as parent (ID “1”).

For example,

head(AML$tree_df)

##   Patient_ID Tree_ID Node_ID Mutation_ID Parent_ID
## 1          1       1       1           0         1
## 2          1       1       2           4         1
## 3          1       1       3           2         2
## 4          1       1       4           1         3
## 5          1       1       5           5         3
## 6          1       1       6           6         3

To convert a dataframe to an TreeMHN object, use the input_tree_df function. For example,

# not run
# input_tree_df(n = AML$n, tree_df = AML$tree_df, mutations = AML$mutations, tree_labels = AML$tree_labels)

To ensure enough precision, we run stability selection with (\gamma = 0.1) and a threshold of (95\%) and obtain a vector of non-selected elements over (1000) subsamples. Again, we recommend to run the code using the parallel package on a cluster (runtime up to 20 min depending on the number of cores available).

RNGkind("L'Ecuyer-CMRG")
gamma <- 0.1
subsample_size <- floor(AML$N / 2)
SS_res <- mclapply(c(1:1000), 
                   function(i) subsample_once(subsample_size, AML, gamma), 
                   mc.cores = detectCores(), 
                   mc.set.seed = TRUE)
to_mask <- get_mask(AML$n, SS_res, 0.95)

Then we refit the model by masking the non-selected elements.

AML_Theta <- learn_MHN(AML, gamma = gamma, to_mask = to_mask)

save(AML_Theta, file = "./Data/AML_morita/AML_Theta.RData")

Now we can plot the learned MHN. The diagonal entries represent the baseline rates of mutations to occur, independent of other events. The off-diagonal entries represent the interactions between mutations.

plot_Theta(AML_Theta, AML$mutations)

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

If we focus on the entries with non-zero off-diagonal entries, we get:

plot_Theta(AML_Theta, AML$mutations, full = FALSE)

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

The plot_observed_pathways function plots the observed trajectories sorted according to their relative frequencies. Given the estimated MHN, we can plot the most probable evolutionary trajectories using the plot_pathways_w_sampling function.

mutation_colors <- colors()[c(7, 11, 20, 143, 419,
                              417, 53, 62, 43, 76,
                              623, 80, 81, 542, 86,
                              93, 96, 524, 101, 102,
                              364, 367, 373, 383, 387,
                              399, 404, 405, 411, 481, 493)]
names(mutation_colors) <- AML$mutations

g1 <- plot_observed_pathways(AML, AML_Theta, mutation_colors = mutation_colors)
g2 <- plot_pathways_w_sampling(AML_Theta, AML$mutations, top_M = 40, mutation_colors = mutation_colors)
ggarrange(g1, g2)

Given a particular tree and the estimated MHN, we can also find the next most probable mutational events using the plot_next_mutations function.

tree_df <- AML$tree_df[AML$tree_df$Tree_ID == match("AML-09_AML-09-001", AML$tree_labels),]
plot_next_mutations(AML$n, 
                    tree_df,
                    AML_Theta,
                    AML$mutations,
                    "AML-09_AML-09-001",
                    8)

## Top 8 most probable mutational events that will happen next:
## The next most probable node: Root->NPM1->KRAS->NRAS 
## Probability: 16.461 %
## Log ratio model vs random: 2.949 
## The next most probable node: Root->NPM1->PTPN11 
## Probability: 10.353 %
## Log ratio model vs random: 2.486 
## The next most probable node: Root->NPM1->KRAS->PTPN11 
## Probability: 10.353 %
## Log ratio model vs random: 2.486 
## The next most probable node: Root->NPM1->KRAS->FLT3 
## Probability: 9.202 %
## Log ratio model vs random: 2.368 
## The next most probable node: Root->NPM1->NRAS 
## Probability: 8.693 %
## Log ratio model vs random: 2.311 
## The next most probable node: Root->NPM1->FLT3->WT1 
## Probability: 7.185 %
## Log ratio model vs random: 2.12 
## The next most probable node: Root->NPM1->FLT3->KRAS 
## Probability: 5.511 %
## Log ratio model vs random: 1.855 
## The next most probable node: Root->DNMT3A 
## Probability: 2.386 %
## Log ratio model vs random: 1.018

cbg-ethz/TreeMHN documentation built on Jan. 29, 2024, 1:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

cbg-ethz/TreeMHN
Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

Example/Demo.md
In cbg-ethz/TreeMHN: Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

1 Load required packages

2 Simulated data

2.1 Generate trees

2.2 Learn MHN from the generated trees

2.3 Performance assessment

3 Real data

3.1 Input dataset

3.2 Learn the MHN

3.3 Plot the network

R Package Documentation

Browse R Packages

We want your feedback!

cbg-ethz/TreeMHN Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

Example/Demo.md In cbg-ethz/TreeMHN: Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

1 Load required packages

2 Simulated data

2.1 Generate trees

2.2 Learn MHN from the generated trees

2.3 Performance assessment

3 Real data

3.1 Input dataset

3.2 Learn the MHN

3.3 Plot the network

R Package Documentation

Browse R Packages

We want your feedback!

cbg-ethz/TreeMHN
Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees

Example/Demo.md
In cbg-ethz/TreeMHN: Joint inference of repeated evolutionary trajectories and patterns of clonal exclusivity from single-cell mutation trees