The zingeR package provides a framework to simulate scRNA-seq data. The user can input a real scRNA-seq dataset to extract feature-level parameters for generating scRNA-seq expression counts. The getDatasetZTNB
function will estimate empirical mean expression proportions for the positive counts $\hat \lambda_g = \frac{1}{|{ y_{gi} > 0}|} \sum_{i \in { y_{gi} > 0}} y_{gi}/N_i$, with $|x|$ denoting the cardinality of the set $x$, $y_{gi}$ the expression count for gene $g$ in sample $i$ and $N_i$ the library size (sequencing depth) of sample $i$. The function also estimates dispersions $\phi_g$ based on the zero-truncated negative binomial (ZTNB) distribution. Additionally, the function derives empirical probabilities on zero counts $p_{gi}$. Aside from these feature-specific parameters, getDatasetZTNB
also estimates the global association of $p_{gi}$ with (a smooth function of) the average log CPM $A_g$ of a gene and the library size $N_i$.
suppressPackageStartupMessages({ library(zingeR) library(Biobase) library(gamlss) library(gamlss.tr) library(edgeR) }) data(islamEset,package="zingeR") islamHlp=exprs(islamEset)[9:2008,] #first 8 are spike-ins. cellType=pData(islamEset)[,"cellType"] paramsIslam=getDatasetZTNB(counts=islamHlp,design=model.matrix(~cellType))
Next we use the parameters to simulate the expression counts. Library sizes for the simulated samples are by default resampled from the input dataset but they can also be specified by the user. The simulation framework models positive and zero counts separately similar to a hurdle model. The zero probability $p_{gi}$ of a gene $g$ is modelled as a function of its expression intensity (in terms of average log counts per million $A_g$) and the sequencing depth $N_i$ of the sample $i$ using a semiparametric additive logistic regression model. The simulation paradigm jointly samples the gene-wise estimated parameters ${\hat \lambda_{gi}, \hat \phi_g, A_g, p_{gi}}$ to retain gene specific characteristics present in the original dataset. Positive counts are then simulated according to a ZTNB distribution with mean $\hat \mu_{gi} = \hat \lambda_{gi} N_i$ and dispersion $\hat \phi_g$. We use the expected probability on zero counts $p_g = \frac{\sum_{i=1}^n p_{gi}}{n}$ to introduce zero counts by simulating from a binomial process. The simulation thus acknowledges both gene-specific characteristics as well as broad associations across all genes and provides realistic scRNA-seq data.
This is implemented in the NBsimSingleCell
function. The simulation settings can be specified with the arguments of the function:
dataset
: an expression matrix of counts representing the dataset on which the simulation is based.group
: group indicator specifying the attribution of the samples to the different conditions of interest that are being simulated.nTags
: the number of features (genes) to simulate. $1000$ by defaultnlibs
: the number of samples to simulate. Defaults to length(group)
.lib.size
: the library sizes for the simulated samples. If NULL
(default), library sizes are resampled from the original datset.pUp
: numeric value between $0$ and $1$ ($0.5$ by default) specifying the proportion of differentially expressed genes that show an upregulation in the second condition.foldDiff
: the fold changes used in simulating the differentially expressed genes. Either one numeric value for specifying the same fold change for all DE genes, or a vector of the same length as ind
to specify fold changes for all differentially expressed genes. Note that only fold changes strictly higher than $1$ should be used as input from which a fraction (defined by pUp
) will be inversed to simulate downregulation in the second condition. Defaults to $3$.verbose
: logical. Should progress be printed?ind
: integer vector specifying the rows of the simulated count matrix that contain differential features.params
: An object containing feature-wise parameters used for simulation, typically the output from getDatasetZTNB
. If NULL
, parameters are estimated from the dataset provided.randomZero
: A numeric value between $0$ and $1$ specifying the random fraction of expression counts that are set to zero after simulating the expression count matrix. Defaults to $0$.min.dispersion
: The minimum dispersion value to use for simulation. $0.1$ by default.max.dipserion
: The maximum dispersion value to use for simulation. $400$ by default.drop.extreme.dispersion
: Only applicable if params=NULL
and used as input to getDatasetZTNB
. Numeric value between $0$ and $1$ specifying the fraction of highest dispersion values to remove after estimating feature-wise parameters.nSamples=80 grp=as.factor(rep(0:1, each = nSamples/2)) #two-group comparison nTags=2000 #nr of features set.seed(436) DEind = sample(1:nTags,floor(nTags*.1),replace=FALSE) #10% differentially expressed fcSim=(2 + rexp(length(DEind), rate = 1/2)) #fold changes libSizes=sample(colSums(islamHlp),nSamples,replace=TRUE) #library sizes simDataIslam <- NBsimSingleCell(foldDiff=fcSim, ind=DEind, dataset=islamHlp, nTags=nTags, group=grp, verbose=TRUE, params=paramsIslam, lib.size=libSizes) simDataIslam$counts[1:5,1:5] #head(simDataIslam$counts) # BCV plots dOrig=suppressWarnings(edgeR::calcNormFactors(DGEList(islamHlp))) dOrig=estimateGLMTagwiseDisp(estimateGLMCommonDisp(dOrig, design=model.matrix(~cellType), interval=c(0,10)),prior.df=0) d=suppressWarnings(edgeR::calcNormFactors(DGEList(simDataIslam$counts))) d=estimateGLMTagwiseDisp(estimateGLMCommonDisp(d, design=model.matrix(~grp), interval=c(0,10)),prior.df=0) par(mfrow=c(1,2)) plotBCV(dOrig,ylim=c(0,13), xlim=c(4,16)) plotBCV(d,ylim=c(0,13), xlim=c(4,16)) par(mfrow=c(1,1)) # association of library size with zeros plot(x=log(colSums(islamHlp)), y=colMeans(islamHlp==0), xlab="Log library size", ylab="Fraction of zeros", xlim=c(5.5,13)) points(x=log(colSums(simDataIslam$counts)), y=colMeans(simDataIslam$counts==0), col=2) # association of aveLogCPM with zeros library(scales) plot(x=edgeR::aveLogCPM(islamHlp), y=rowMeans(islamHlp==0), xlab="Average log CPM", ylab="Fraction of zeros", ylim=c(0,1), col=alpha(1,1/2), pch=19, cex=.3) points(x=edgeR::aveLogCPM(simDataIslam$counts), y=rowMeans(simDataIslam$counts==0),col=alpha(2,1/2),pch=19,cex=.3)
We first perform a differential expression analysis using zingeR posterior probabilities and focussing on the count component of the ZINB model. We use edgeR to model the count component.
The weights are estimated with the core function of zingeR, zeroWeightsLS
. It is important to be consistent with the normalization procedure, i.e. if you use TMM normalization for the analysis, you should also use it for estimating the zingeR posterior probabilities. In a next section, we demonstrate a zero-inflated NB analysis using DESeq2 where we re-estimate the weights using the proper DESeq2 normalization. The user can also input normalization factors from any normalization procedure of interest by using the normFactors
argument in zeroWeightsLS
.
The weights can then be provided in the DGEList object which will be used for dispersion estimation in the native edgeR estimateDisp
function.
After estimation of the dispersions and zingeR posterior probabilities, the glmWeightedF
function is used for statistical inference. This is an adapted function from the glmLRT
function of edgeR. It uses an F-test for which the denominator degrees of freedom are by default adjusted according to the downweighting of excess zeros (argument ZI=TRUE
). Also, independent filtering can be performed on the obtained p-values (argument independentFiltering=TRUE
). We use the independent filtering strategy that was originally implemented in DESeq2. By default, the average fitted values are used as a filter criterion.
# zingeR_edgeR analysis library(edgeR) d=DGEList(simDataIslam$counts) d=suppressWarnings(calcNormFactors(d)) design=model.matrix(~grp) weights <- zeroWeightsLS(counts=d$counts, design=design, maxit=200, normalization="TMM") d$weights <- weights d=estimateDisp(d, design) plotBCV(d) fit=glmFit(d,design) lrt=glmWeightedF(fit,coef=2, independentFiltering = TRUE) hist(lrt$table$PValue) sum(lrt$table$padjFilter<=0.05, na.rm=TRUE) mean(simDataIslam$indDE %in% which(lrt$table$padjFilter<=0.05)) #TPR mean(which(lrt$table$padjFilter<=0.05) %in% simDataIslam$indNonDE) #FDP
For the zingeR_DESeq2 analysis we re-estimate the weights using the positive counts normalization recently implemented in DESeq2 and first developed as part of the phyloseq
Bioconductor package. The weights should then be provided to the DESeq object using the assays
accessor function, after which a conventional DESeq2 analysis follows. Note that we allow for default shrinkage of the fold changes by specifying betaPrior=TRUE
. We adjust for downweighting of excess zeros by using the t-distribution as the null distribution (argument useT=TRUE
) with degrees of freedom for gene $g$, $df_g = (\sum_{i=1}^n w_{gi}) - p$, with $w_{gi}$ the weights for gene $g$ in sample $i$ and $p$ the number of coefficients, i.e. the number of columns of the design matrix. The adjusted degrees of freedom are specified with the df
argument in nbinomWaldTest
DESeq2 function.
# zingeR_DESeq2 analysis library(DESeq2) colData=data.frame(grp=grp) design=model.matrix(~grp) dse=DESeqDataSetFromMatrix(countData=simDataIslam$counts, colData=colData, design=~grp) weights <- zeroWeightsLS(counts=simDataIslam$counts, design=design, maxit=200, normalization="DESeq2_poscounts", colData=colData, designFormula=~grp) assays(dse)[["weights"]]=weights dse = DESeq2::estimateSizeFactors(dse, type="poscounts") dse = estimateDispersions(dse) dse = nbinomWaldTest(dse, betaPrior=TRUE, useT=TRUE, df=rowSums(weights)-2) res = results(dse) sum(res$padj<=0.05, na.rm=TRUE) mean(simDataIslam$indDE %in% which(res$padj<=0.05)) #TPR mean(which(res$padj<=0.05) %in% simDataIslam$indNonDE) #FDP
Aside from using edgeR or DESeq2 for the negative binomial count component of the zero-inflated negative binomial model, the user can use any method of choice by incorporating the weights as estimated with zeroWeightsLS
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.