Simulating scRNA-seq data

The zingeR package provides a framework to simulate scRNA-seq data. The user can input a real scRNA-seq dataset to extract feature-level parameters for generating scRNA-seq expression counts. The getDatasetZTNB function will estimate empirical mean expression proportions for the positive counts $\hat \lambda_g = \frac{1}{|{ y_{gi} > 0}|} \sum_{i \in { y_{gi} > 0}} y_{gi}/N_i$, with $|x|$ denoting the cardinality of the set $x$, $y_{gi}$ the expression count for gene $g$ in sample $i$ and $N_i$ the library size (sequencing depth) of sample $i$. The function also estimates dispersions $\phi_g$ based on the zero-truncated negative binomial (ZTNB) distribution. Additionally, the function derives empirical probabilities on zero counts $p_{gi}$. Aside from these feature-specific parameters, getDatasetZTNB also estimates the global association of $p_{gi}$ with (a smooth function of) the average log CPM $A_g$ of a gene and the library size $N_i$.

suppressPackageStartupMessages({
library(zingeR)
library(Biobase)
library(gamlss)
library(gamlss.tr)
library(edgeR)
})
data(islamEset,package="zingeR")
islamHlp=exprs(islamEset)[9:2008,] #first 8 are spike-ins.
cellType=pData(islamEset)[,"cellType"]
paramsIslam=getDatasetZTNB(counts=islamHlp,design=model.matrix(~cellType))

Next we use the parameters to simulate the expression counts. Library sizes for the simulated samples are by default resampled from the input dataset but they can also be specified by the user. The simulation framework models positive and zero counts separately similar to a hurdle model. The zero probability $p_{gi}$ of a gene $g$ is modelled as a function of its expression intensity (in terms of average log counts per million $A_g$) and the sequencing depth $N_i$ of the sample $i$ using a semiparametric additive logistic regression model. The simulation paradigm jointly samples the gene-wise estimated parameters ${\hat \lambda_{gi}, \hat \phi_g, A_g, p_{gi}}$ to retain gene specific characteristics present in the original dataset. Positive counts are then simulated according to a ZTNB distribution with mean $\hat \mu_{gi} = \hat \lambda_{gi} N_i$ and dispersion $\hat \phi_g$. We use the expected probability on zero counts $p_g = \frac{\sum_{i=1}^n p_{gi}}{n}$ to introduce zero counts by simulating from a binomial process. The simulation thus acknowledges both gene-specific characteristics as well as broad associations across all genes and provides realistic scRNA-seq data.

This is implemented in the NBsimSingleCell function. The simulation settings can be specified with the arguments of the function:

nSamples=80
grp=as.factor(rep(0:1, each = nSamples/2)) #two-group comparison
nTags=2000 #nr of features
set.seed(436)
DEind = sample(1:nTags,floor(nTags*.1),replace=FALSE) #10% differentially expressed
fcSim=(2 + rexp(length(DEind), rate = 1/2)) #fold changes
libSizes=sample(colSums(islamHlp),nSamples,replace=TRUE) #library sizes
simDataIslam <- NBsimSingleCell(foldDiff=fcSim, ind=DEind, dataset=islamHlp, nTags=nTags, group=grp, verbose=TRUE, params=paramsIslam, lib.size=libSizes)
simDataIslam$counts[1:5,1:5]
#head(simDataIslam$counts)

# BCV plots
dOrig=suppressWarnings(edgeR::calcNormFactors(DGEList(islamHlp)))
dOrig=estimateGLMTagwiseDisp(estimateGLMCommonDisp(dOrig, design=model.matrix(~cellType), interval=c(0,10)),prior.df=0)

d=suppressWarnings(edgeR::calcNormFactors(DGEList(simDataIslam$counts)))
d=estimateGLMTagwiseDisp(estimateGLMCommonDisp(d, design=model.matrix(~grp), interval=c(0,10)),prior.df=0)

par(mfrow=c(1,2))
plotBCV(dOrig,ylim=c(0,13), xlim=c(4,16))
plotBCV(d,ylim=c(0,13), xlim=c(4,16))
par(mfrow=c(1,1))

# association of library size with zeros
plot(x=log(colSums(islamHlp)), y=colMeans(islamHlp==0), xlab="Log library size", ylab="Fraction of zeros", xlim=c(5.5,13))
points(x=log(colSums(simDataIslam$counts)), y=colMeans(simDataIslam$counts==0), col=2)

# association of aveLogCPM with zeros
library(scales)
plot(x=edgeR::aveLogCPM(islamHlp), y=rowMeans(islamHlp==0), xlab="Average log CPM", ylab="Fraction of zeros", ylim=c(0,1), col=alpha(1,1/2), pch=19, cex=.3)
points(x=edgeR::aveLogCPM(simDataIslam$counts), y=rowMeans(simDataIslam$counts==0),col=alpha(2,1/2),pch=19,cex=.3)

Differential expression analysis

zingeR-edgeR

We first perform a differential expression analysis using zingeR posterior probabilities and focussing on the count component of the ZINB model. We use edgeR to model the count component.

The weights are estimated with the core function of zingeR, zeroWeightsLS. It is important to be consistent with the normalization procedure, i.e. if you use TMM normalization for the analysis, you should also use it for estimating the zingeR posterior probabilities. In a next section, we demonstrate a zero-inflated NB analysis using DESeq2 where we re-estimate the weights using the proper DESeq2 normalization. The user can also input normalization factors from any normalization procedure of interest by using the normFactors argument in zeroWeightsLS.

The weights can then be provided in the DGEList object which will be used for dispersion estimation in the native edgeR estimateDisp function.

After estimation of the dispersions and zingeR posterior probabilities, the glmWeightedF function is used for statistical inference. This is an adapted function from the glmLRT function of edgeR. It uses an F-test for which the denominator degrees of freedom are by default adjusted according to the downweighting of excess zeros (argument ZI=TRUE). Also, independent filtering can be performed on the obtained p-values (argument independentFiltering=TRUE). We use the independent filtering strategy that was originally implemented in DESeq2. By default, the average fitted values are used as a filter criterion.

# zingeR_edgeR analysis
library(edgeR)
d=DGEList(simDataIslam$counts)
d=suppressWarnings(calcNormFactors(d))
design=model.matrix(~grp)
weights <- zeroWeightsLS(counts=d$counts, design=design, maxit=200, normalization="TMM")
d$weights <- weights
d=estimateDisp(d, design)
plotBCV(d)
fit=glmFit(d,design)
lrt=glmWeightedF(fit,coef=2, independentFiltering = TRUE)
hist(lrt$table$PValue)
sum(lrt$table$padjFilter<=0.05, na.rm=TRUE)
mean(simDataIslam$indDE %in% which(lrt$table$padjFilter<=0.05)) #TPR
mean(which(lrt$table$padjFilter<=0.05) %in% simDataIslam$indNonDE) #FDP

zingeR-DESeq2

For the zingeR_DESeq2 analysis we re-estimate the weights using the positive counts normalization recently implemented in DESeq2 and first developed as part of the phyloseq Bioconductor package. The weights should then be provided to the DESeq object using the assays accessor function, after which a conventional DESeq2 analysis follows. Note that we allow for default shrinkage of the fold changes by specifying betaPrior=TRUE. We adjust for downweighting of excess zeros by using the t-distribution as the null distribution (argument useT=TRUE) with degrees of freedom for gene $g$, $df_g = (\sum_{i=1}^n w_{gi}) - p$, with $w_{gi}$ the weights for gene $g$ in sample $i$ and $p$ the number of coefficients, i.e. the number of columns of the design matrix. The adjusted degrees of freedom are specified with the df argument in nbinomWaldTest DESeq2 function.

# zingeR_DESeq2 analysis
library(DESeq2)
colData=data.frame(grp=grp)
design=model.matrix(~grp)
dse=DESeqDataSetFromMatrix(countData=simDataIslam$counts, colData=colData, design=~grp)
weights <- zeroWeightsLS(counts=simDataIslam$counts, design=design, maxit=200, normalization="DESeq2_poscounts", colData=colData, designFormula=~grp)
assays(dse)[["weights"]]=weights
dse = DESeq2::estimateSizeFactors(dse, type="poscounts")
dse = estimateDispersions(dse)
dse = nbinomWaldTest(dse, betaPrior=TRUE, useT=TRUE, df=rowSums(weights)-2)
res = results(dse)
sum(res$padj<=0.05, na.rm=TRUE)
mean(simDataIslam$indDE %in% which(res$padj<=0.05)) #TPR
mean(which(res$padj<=0.05) %in% simDataIslam$indNonDE) #FDP

Aside from using edgeR or DESeq2 for the negative binomial count component of the zero-inflated negative binomial model, the user can use any method of choice by incorporating the weights as estimated with zeroWeightsLS.



statOmics/zingeR documentation built on May 20, 2019, 6:48 p.m.