Use the following parameter settings for measuring model performance on all parameters and latent variables:
N = 200
(Number of cells)M = 100
(Number of genes)K = 4
(Number of clusters)pi_k = (.2, .4, .15, .25)
(Mixing proportions)vb_max_iter = 300
(VB iterations)total_sims = 10
(Total simulations)gaussian_noise = 0.05
(Noise on the probability of success per CpG)gaussian_noise = 0.15
(For high noise experiments)Function for generating synthetic data: preprocessing/synthetic/encode_coverage.R
Parameters
cluster_var = 0.5
(% of regions that are dissimilar across clusters)data_train_prcg = 0.1
(% of data to keep fully for training)region_train_prcg = 1
(% of regions kept for training)cpg_train_prcg = seq(.1, .9, .1)
(% of CpGs kept for training in each region)We set a specific value for the cpg_train_prcg = 0.4
(supplementary figure is cpg_train_prcg = 0.8
) and change the dissimilarity percentage to show the robustness of the model across different dissimilarity levels.
Function for generating synthetic data: preprocessing/synthetic/encode_dissimilarity.R
Parameters
cluster_var = seq(0, 1, .1)
(% of regions that are dissimilar across clusters)data_train_prcg = 0.1
(% of data to keep fully for training)region_train_prcg = 1
(% of regions kept for training)cpg_train_prcg = 0.4
(% of CpGs kept for training in each region)From the above analysis we obtain the posterior probability of each cell belonging to each cluster and then we measure cluster performance using the Adjusted Rand Index (ARI).
On the same data run 10 simulations with K = 10 clusters and check when the model returns the actual 4 clusters that generated the data, file model-selection.R
. To obtain a better understanding on the model selection from the variational approximation we use a broad and strict prior over the weights i.e.
Broad (uninformative) prior
delta_0 <- rep(3, K)
(Dirichlet prior)alpha_0 <- 0.5
(Gamma prior)beta_0 <- 10
(Gamma prior)Strict (shrinkage) prior
delta_0 <- rep(1e-2, K)
(Dirichlet prior)alpha_0 <- 2
(Gamma prior)beta_0 <- 0.5
(Gamma prior)Run VB model and Gibbs model on synthetic data (file model-efficiency.R
) and show the efficiency of the VB model.
Parameters:
N = c(50, 100, 200, 500, 1000, 2000)
(Number of cells)M = 200
(Number of genes)K = 3
(Number of clusters)pi_k = (.3, .4, .3)
(Mixing proportions)vb_max_iter = 400
(VB iterations)gibbs_nsim = 3000
(VB iterations)total_sims = 5
(Total simulations)Data in datasets/ENCODE/scBS-seq/parsed/binarised
Scripts in datasets/ENCODE/scBS-seq/preprocessing
The same preprocessing steps as the real data. Use 40 H1-hESC (pseudo)-single cells and 40 GM12878 (pseudo)-single cells.
Filtering process: For 10kb
region keep regions with at least 10 CpGs coverage, and for 5kb
keep regions with at least 8 CpGs.
During imputation process 3 different training procedures: 20%, 50% and 80% CpGs used for training set. The remaining CpGs are used for test set.
Steps for pre-processing methylation data:
create_promoter.R
or create_region.R
.filter_met.R
function, which will keep regions with at least 10 CpG coverage and that have variability across cells. More specifically:cov = 10
met_sd = 0.2
rna_var = 5
cov = 10
met_sd = 0.2
rna_var = 5
cov = 15
met_sd = 0.2
rna_var = 5
cov = 10
met_sd = 0.2
region_annot > 1000bp
cov = 10
met_sd = 0.2
region_annot > 1000bp
cov = 10
met_sd = 0.2
region_annot > 1000bp
Finally, we kept only regions that had 10 CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.
For imputation we use the following parameters settings
K = 6
(Number of clusters)filt_region_cov = 0.5
(Filter genes that have low coverage across cells)data_train_prcg = 0.4
(% of data that will be used fully for training set)region_train_prcg = 0.95
(% of samples to keep for training, i.e. some genes will have no coverage for a given cell.)cpg_train_prcg = 0.5
(% of CpGs to keep for training at each region/cell)total_sims = 10
(Total simulations)Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R
) and show the efficiency of the VB model for all contexts.
Parameters: The same as the ones when running the model for clustering and imputation.
suppressPackageStartupMessages(library(kableExtra)) suppressPackageStartupMessages(library(data.table)) suppressPackageStartupMessages(library(purrr)) dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.62", "2.1", "6", "0.7", "0.18", "0.5"), "Time" = c("09:58 - 11:17", "10:38 - 13:36", "09:57 - 15:34", "09:52 - 11:38", "09:54 - 10:31", "09:53 - 10:54"), "Time elapsed (in hours)" = c("1.31", "2.9", "5.6", "1.76", "0.61", "1")) dt %>% kable("html") %>% kable_styling()
Steps for pre-processing methylation data:
create_region.R
.filter_met.R
function, which will keep regions with at least N CpG coverage and that have variability across cells. More specifically:cov = 10
met_sd = 0.2
cov = 15
met_sd = 0.2
cov = 20
met_sd = 0.2
cov = 10
met_sd = 0.2
region_annot > 1000bp
cov = 10
met_sd = 0.2
region_annot > 1000bp
cov = 10
met_sd = 0.2
region_annot > 1000bp
Finally, we kept only regions that had N CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.
For imputation we use the following parameters settings
K = 5
(Number of clusters)filt_region_cov = 0.5
(Filter genes that have low coverage across cells)data_train_prcg = 0.4
(% of data that will be used fully for training set)region_train_prcg = 0.95
(% of samples to keep for training, i.e. some genes will have no coverage for a given cell.)cpg_train_prcg = 0.5
(% of CpGs to keep for training at each region/cell)total_sims = 10
(Total simulations)Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R
) and show the efficiency of the VB model for all contexts.
Parameters: The same as the ones when running the model for clustering and imputation.
suppressPackageStartupMessages(library(kableExtra)) suppressPackageStartupMessages(library(data.table)) suppressPackageStartupMessages(library(purrr)) dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.98", "1.54", "4.13", "0.85", "0.26", "0.29"), "Time" = c("08:32 - 10:41", "08:31 - 10:44", "11:19 - 15:24", "08:33 - 10:30", "08:35 - 09:30", "08:34 - 09:28"), "Time elapsed (in hours)" = c("1.83", "2.21", "4", "2", "0.91", "0.9")) dt %>% kable("html") %>% kable_styling()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.