# To render an HTML version that works nicely with github and web pages, do: # rmarkdown::render("vignettes/escoter.Rmd", "all") knitr::opts_chunk$set(fig.align = 'center', fig.width = 10, fig.height = 8, dev = 'png') # Use exact BSPARAM to avoid warnings options(BiocSingularParam.default=BiocSingular::ExactParam()) library(ESCO) showdir = "./show/" if(!dir.exists(showdir))dir.create(showdir)
Welcome to ESCO
! ESCO
is an R package for the simple simulation of
single-cell RNA sequencing data with special focus on gene co-expression, built using the infrastructure of the splatter
package. This vignette gives an overview and introduction to ESCO's functionality.
ESCO
can be downloaded from Github: https://github.com/JINJINT/ESCO
We use the term 'ESCO' to refer to the newly proposed Ensemble single cell simulator with gene co-expression, and differentiate it from the package SplatterESCO itself. As for the marginal behaviour of genes and cells, the core of the ESCO model is foundimentally the same with the original Splat: they both use a gamma-Poisson distribution with a enforced mean-variance trend to generate gene counts, and use a log-normal/normal distribution to give an expected library size for each cell, while outlier genes (genes with mean expression outside the gamma distribution) and dropouts (random knock out of counts based on mean expression) are introduced along the way.
ESCO can also simulate differential expression between groups of different types of cells or differentiation paths between different cells types where expression changes in a continuous way. These are described further in the [ESCO simulation] section.
Once we have a set of parameters we are happy with, we can use escoSimulate
to simulate counts. If we want to make small adjustments to the parameters we
can provide them as additional arguments, alternatively if we don't supply any
parameters the defaults will be used:
sim <- escoSimulate(nGenes = 100, nCells = 50) sim # Access the counts data = assays(sim)$TrueCounts data[1:5, 1:5] # Information about genes head(rowData(sim)) # Information about cells head(colData(sim)) # Information about paramters str(metadata(sim)$Params)
Looking at the output of escoSimulate
we can see that sim
is
SingleCellExperiment
object contains phenotype information about
each cell (accessed using colData
); feature information about each gene (accessed using rowData
); genes by cells data matrix and intermediate values of the simulation (accessed using assays
); and parameters used to generate this simulation (accessed using metadata
).
The main part of this object is a series of features
by samples matrix containing the simulated true counts (accessed using assays(sim)$TrueCounts
); counts with zero-inflation noise; (accessed using assays(sim)$counts
); counts with downsample noise ((accessed using assays(sim)$observedcounts
). By default, all three kinds of data matrix will be simulated, while users are allowed to simulated only one or two of them for time/space saving, via specifying the dropout.type
in parameters setting.
For more details about the SingleCellExperiment
object refer to the vignette. For information about what you can do with scater
refer to the scater
documentation and vignette.
Additionally, in order to have better reproducibility, ESCO also allows multiple trials of noisy data generated from one common true counts in one single call, and save them seperately in the directory user provided. The returned sim
on the other hand will only contain one trial of noisy data in order to avoid memory exhuastion.
if(!dir.exists(paste0(showdir,"case1"))) dir.create(paste0(showdir,"case1")) sim <- escoSimulate(nGenes = 100, nCells = 50, trials = 2, dirname = paste0(showdir,"case1/"), verbose = FALSE) list.files(path = paste0(showdir,"case1/"), full.names = FALSE)
Moreover, ESCO also allows noisy data of different configuration of noise level data generated from one common true counts in one single call, and save them seperately in the directory user provided. The returned sim
on the other hand will only contain one configuration of noisy data in order to avoid memory exhuastion.
if(!dir.exists(paste0(showdir,"case2")))dir.create(paste0(showdir,"case2")) sim <- escoSimulate(nGenes = 100, nCells = 50, dropout.mid = c(1,2), alpha_mean = c(0.1, 0.2), trials = 2, dirname = paste0(showdir,"case2/"), verbose = FALSE) list.files(path = paste0(showdir,"case2/"), full.names = FALSE)
In this section, we will show specific examples of how to simulate one homogenrous cell with/ without gene co-expression group using ESCO. Fisrtly, we show exmaple of simulation without gene co-expression, where heatdata
and heatgcn
are newly added functions for convinient visualization.
sim <- escoSimulateSingle(nGenes = 100, nCells = 50, lib.loc = 7, withcorr = TRUE, verbose = FALSE) # get the data datalist = list("simulated truth" = assays(sim)$TrueCounts, "zero-inflated" = assays(sim)$counts, "down-sampled" = assays(sim)$observedcounts) # plotting the data heatdata(datalist, norm = FALSE, size = 2, ncol = 3)
# plotting the GCN simparams = metadata(sim)$Params rholist = slot(simparams,"corr") corrgenes = rownames(rholist[[1]]) gcnlist = lapply(datalist, function(data)gcn(data, genes = corrgenes)) gcnlist = append(gcnlist, list("given truth" = rholist[[1]]), 0) heatgcn(gcnlist, size = 3, ncol = 4)
You can also provide your own defined GCN. In the following, we take the randomly generated correlation structure in the last simulation as our input.
sim <- escoSimulateSingle(nGenes = 100, nCells = 50, lib.loc = 7, withcorr = TRUE, corr = rholist, verbose = FALSE) # plotting the data datalist = list("simulated truth" = assays(sim)$TrueCounts, "zero-inflated" = assays(sim)$counts, "down-sampled" = assays(sim)$observedcounts) heatdata(datalist, norm = FALSE, size = 2, ncol = 3)
# plotting the GCN simparams = metadata(sim)$Params rholist = slot(simparams,"corr") corrgenes = rownames(rholist[[1]]) gcnlist = lapply(datalist, function(data)gcn(data, genes = corrgenes)) gcnlist = append(gcnlist, list("given truth" = rholist[[1]]), 0) heatgcn(gcnlist, size = 2, ncol = 4)
So far we have only simulated a single population of cells but often we are
interested in investigating a mixed population of cells and looking to see what
cell types are present or what differences there are between them. ESCO is
able to simulate these situations by changing the method
argument. Here we are
going to simulate two groups, by specifying the group.prob
, de.prob
parameters. The gene-gene correlation can be generated randomly (by default), or given by users.
sim<-escoSimulateGroups(nGenes = 200, nCells = 100, group.prob = c(0.6, 0.4), deall.prob = 0.3, de.prob = c(0.3, 0.7), de.facLoc = c(1.9, 2.5), withcorr = TRUE, trials = 1, verbose =FALSE) # organize the marker gene info genegroup = paste0("Group", rowData(sim)$GeneGroup) genegroup[which(genegroup=="Group0")] = "None" geneinfo = data.frame(genes = rowData(sim)$Gene, newcelltype = as.factor(genegroup)) # organize the cell info cellinfo = data.frame(cells = colData(sim)$Cell, newcelltype=as.factor(colData(sim)$Group)) # get the data datalist = list("simulated truth" = assays(sim)$TrueCounts, "zero-inflated" = assays(sim)$counts, "down-sampled" = assays(sim)$observedcounts) # plotting the data heatdata(datalist, cellinfo = cellinfo, geneinfo = geneinfo, size = 1, ncol = 3)
# plot GCN for all marker genes (i.e. DE genes) across all cell groups degeneinfo = geneinfo[which(geneinfo$newcelltype!="None"),] degeneinfo$newcelltype = droplevels(degeneinfo$newcelltype) degcnlist = lapply(datalist, function(data)gcn(data, genes = degeneinfo$genes)) heatgcn(degcnlist, geneinfo = degeneinfo, size = 2, ncol = 3)
# plot GCN for marker gene within one cell group simparams = metadata(sim)$Params rholist = slot(simparams,"corr") group2_gcnlist = lapply(datalist, function(data){ gcn(data[,which(colData(sim)$Group=="Group2")], CPM2 = TRUE, genes = rownames(rholist[["Group2"]]))}) group2_gcnlist = append(group2_gcnlist, list("given truth" = rholist[["Group2"]]), 0) heatgcn(group2_gcnlist, size = 3, ncol = 4)
#generate the tree yaml=" name: All Group1: Group1-1: params: NULL Group2: Group2-1: params: NULL Group2-2: params: NULL " os.list = yaml::yaml.load(yaml) tree = data.tree::as.Node(os.list) tree = data.tree::as.phylo.Node(tree)
# simulation sim<-escoSimulateTree(nGenes = 500, nCells = 300, tree = list(tree), group.prob = c(0.4, 0.3, 0.3), deall.prob = 0.3, de.center = 2, de.prob = c(0.5, 0.5, 0.5), de.facLoc = c(1.5, 2, 3), de.facScale = c(0.9, 0.9, 0.9), withcorr = TRUE, trials = 1, verbose = FALSE) # get the data datatrue = assays(sim)$TrueCounts # get the cellinfo cellinfo = data.frame(cell = colData(sim)$Cell, newcelltype = as.factor(colData(sim)$Group)) levels(cellinfo$newcelltype) = tree$tip.label # get the geneinfo genegroup = paste0("Group", rowData(sim)$GeneGroup) genegroup[which(genegroup=="Group0")] = "None" geneinfo = data.frame(genes = rowData(sim)$Gene, newcelltype = as.factor(genegroup)) levels(geneinfo$newcelltype)[1:3] = tree$tip.label # get the DE geneinfo groups <- colData(sim)$Group group.names <- sort(unique(groups)) group.facs.gene <- rowData(sim)[, paste0("DEFac", group.names)] DEgene.name = as.character(rowData(sim)$Gene[which(group.facs.gene[,1]>1)]) degeneinfo = geneinfo[match(DEgene.name, geneinfo$genes),] # plot the data heatdata(list(datatrue), colv = TRUE, cellinfo = cellinfo, geneinfo = degeneinfo, genes = degeneinfo$genes, size = 1.5, ncol = 1)
The other situation that is often of interest is a differentiation process where
one cell type is changing into another. escoter approximates this process by
simulating a series of steps between two groups and randomly assigning each
cell to a step. We can create this kind of simulation using the "paths"
method.
sim <- escoSimulateTraj(nGenes = 1000, nCells = 500, paths.deprob = 0.3, paths.design = data.frame( Path = c(1, 2, 3), From = c(0, 1, 2), Steps = c(100, 100, 100) ), cells.design = data.frame( Path = c(1, 2, 3), Probability = c(0.3, 0.3, 0.4), Alpha = 1, Beta = 1 ), withcorr = TRUE, verbose = FALSE) datatrue = assays(sim)$TrueCounts # get the cellinfo cellinfo = data.frame(cell = colData(sim)$Cell, newcelltype = colData(sim)$Path) # get the pesudo time celltime = data.frame(path = as.numeric(colData(sim)$Path), step = as.numeric(colData(sim)$Step)) celltime = order(celltime[,1], celltime[,2]) # get the geneinfo params = metadata(sim)$Params pathde = slot(params, "paths.DEgenes") degenes = which(pathde==1) # plot the data heatdata(list("simulated truth" = datatrue[degenes,]), cellinfo = cellinfo, colv = celltime, size = 1, ncol = 1)
unlink("./show", recursive=TRUE) sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.