# knitr options library(knitr) knit_hooks$set(backspace = pkgmaker::hook_backspace()) opts_chunk$set(dpi=300, dev="png")#, dev.args=list(pointsize=15)) IS_PKGDOWN <- nzchar(Sys.getenv('_R_PKGDOWN_RENDER')) # packages library(BiocStyle) # pkgmaker::load_project('pancreas/bseq-sc') library(bseqsc) library(ggplot2) library(reshape2) library(plyr) # load color theme COLOR_THEME <- bseqsc:::.ggtheme() BSEQSC <- 'BSEQ-sc' BSEQSCpkg <- 'bseqsc'
if( local_data <- dir.exists(data_dir <- file.path(pkgmaker::find_devpackage('pancreas'), 'docs/data')) ){ eset <- readRDS(file.path(data_dir, 'GSE50244.rds')) eislet <- readRDS(file.path(data_dir, 'islet-eset.rds')) }
cat(" ## How to cite `r BSEQSC` {-} When using this work in any way, please cite the following work: > *A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure*<br /> > <small>M. Baron, A. Veres, S.L. Wolock, A. L. Faust, R. Gaujoux, A. Vetere, J. Hyoje Ryu, B. K. Wagner, S. Shen-Orr, A. M. Klein, D. A. Melton, I. Yanai<br /></small> > Cell Systems. 2016 Oct 26 [10.1016/j.cels.2016.08.011](https://www.ncbi.nlm.nih.gov/pubmed/27667365) ")
In order to run the entire deconvolution pipeline, users need to install the r Githubpkg('shenorrlab/bseq-sc', 'bseqsc')
R package from GitHub.
See the installtion instructions for details.
r BSEQSC
uses two types of input data:
r BSEQSC
analysis pipelineSingle cell RNA-seq information is used to deconvolve bulk RNA-seq samples by estimating the cell-type proportion of key cell-types. Statistical deconvolution is then used to leverage the variation in cell type frequencies between individuals to estimate average cell-type specific expression in 2 groups of individuals and to compute cell type-specific differential expression.
We used this approach to investigate gene expression differences between diabetics and healthy individuals, using single cell RNA-seq data obtained from an independent set of pancreatic islets (Figure 1). However, it is applicable to other type of tissue, for which both bulk and single cell gene expression data are available.
cat("![](images/pipeline-800x600.png)")
We reproduce here the analysis detailed in [@Baron2016].
Bioconductor base package provides the ExpressionSet
class, which is a convenient data structure to hold expression data along with sample/feature annotation.
Here we use two ExpressionSet
objects to handle the bulk and single cell data respectively.
The dataset's GEO entry (GSE50244) contains raw RNA-seq and sample annotation data. For the purpose of this vignette, we will use data that we pre-processed and made available on the data download page.
After pre-processing, the data were filtered to remove unmapped tags, or tags that mapped to pseudogenes, or known ribosomal/mithocondrial genes (list available in file data/genes_mitrib.txt.gz
or via function getMITRIB()
provided in the package).
Moreover, counts from transcripts that mapped to a same gene were averaged.
Sample phenotypic annotations were also processed to define glycemic groups based on hba1c measurements, as per the original publication [@Fadista2014].
# download GEO dataset from Github eset <- readRDS('https://shenorrlab.github.io/bseqsc/data/GSE50244.rds')
# for this analysis we only look at samples with hba1c data eset <- droplevels(eset[, !is.na(eset$hba1c)]) eset
We generated single cell RNA-seq data of pancreatic islets from 3 healthy individuals, which provided gene expression profiles for 17434 genes in 7729 cells.
The raw counts for these data are available on the data download
page, in the form of an
ExpressionSet
:
eislet <- readRDS('https://shenorrlab.github.io/bseqsc/data/islet-eset.rds') eislet
After identification of the cell types by clustering, we extract marker genes for each cell type, i.e. sets of genes that are mostly highly expressed in each cell type, which can be done in multiple ways, see for example details in [@Baron2016].
The r BSEQSCpkg
package provides the list of selected marker genes for the pancreatic islets, stored in data object pancreasMarkers
:
data(pancreasMarkers) str(pancreasMarkers)
Once cell type-specific markers have been identified, we build a reference basis matrix from their single cell gene expression, which we use to infer the composition of bulk tissue samples from their total gene expression.
The purpose of this matrix is to provide reference expression profiles that together discriminate the cell types of interest. We compute it by averaging, within each cell type, the gene expression of each marker gene across all the cells in the cell type.
Notably, the computation is performed on scaled single cell count data, to better reflect cell type-specific variations in average transcript counts. Indeed, looking at the average counts within each cell type shows how cell types have varying levels of RNA transcripts:
# average counts computed within each cell type in each sample plotCellTotals(eislet, 'cellType', 'sampleID')
Because we want the basis matrix of reference profiles to reflect these biological differences, we use these average counts to re-scale the CPM data before computing the average gene expression.
B <- bseqsc_basis(eislet, pancreasMarkers, clusters = 'cellType', samples = 'sampleID', ct.scale = TRUE)
The resulting basis matrix should show a clear cell type-specific expression pattern (Figure 2):
plotBasis(B, pancreasMarkers, Colv = NA, Rowv = NA, layout = '_', col = 'Blues')
Cell type proportions are estimated using CIBERSORT, a method that was developed to investigate immune infiltrating landscape in tumour tissue [@Newman2015; @Gentles2015] (See section Requirements for setup instructions).
fit <- bseqsc_proportions(eset, B, verbose = TRUE)
To facilitate plotting and model fitting in downstream analysis, we add the estimated proportions as extra phenotypic variables to the ExpressionSet
:
pData(eset) <- cbind(pData(eset), t(coef(fit)))
Estimated proportions can then be compared between the two HBA1C groups, within each cell type (Figure 3).
local({ df <- melt(pData(eset), measure.vars = rownames(coef(fit))) ggplot(df, aes(x = hba1c_class2, y = value, color = hba1c_class2)) + geom_boxplot(width = .3) + facet_wrap( ~ variable, nrow = 1, scales = 'free') + scale_x_discrete(breaks = NULL) + scale_color_brewer(type = 'qual', palette = 'Set1', direction = -1) + xlab('') + ylab('Frequency') + guides(color = guide_legend('')) })
Now that we have estimates for the main cell types, we can use r Biocpkg('EdgeR')
to derive a set of genes that are likely to be differentially regulated at the cell type level.
To do so, we fit 2 models of gene expression differences between HBA1C groups (IGT, Diabetic, Normal): * a base model only includes the HBA1C group variable, as well as gender and age covariates; * an extended model that additionally includes the proportions of alpha, beta, ductal and acinar cells.
Genes that are likely to be differentially regulated at the cell type level are those whose p-values improve (i.e. decrease) when adjusting for cell type proportions (red dots on Figure 4). Here we used a p-value threshold of 0.01.
To fit these models we use the utility function fitEdgeR
included in the package:
fitEdgeR
# base fit_edger <- fitEdgeR(eset, ~ gender + ageN + hba1c_class, coef = c('hba1c_classIGT', 'hba1c_classDiabetic')) # extended fit_edger_ext <- fitEdgeR(eset, ~ gender + ageN + alpha + beta + ductal + acinar + hba1c_class , coef = c('hba1c_classIGT', 'hba1c_classDiabetic'))
We then visualize
# gather P-values from both models edger_pvals <- cbind(fit_edger$table[, 'Symbol', drop = FALSE], Base = fit_edger$table$PValue , Adjusted = fit_edger_ext$table$PValue[match(fit_edger$table$Symbol, fit_edger_ext$table$Symbol)] , stringsAsFactors = FALSE) edger_pvals <- mutate(edger_pvals, Regulated = Adjusted <= 0.01 & Adjusted <= Base) # plot Base vs Adjusted pvalueScatter(Base ~ Adjusted, edger_pvals, pval.th = 0.01, label.th = 3.5)
Now, we run a cell type-specific differential analysis using csSAM
[@Shen-Orr2010], on the subset of selected genes, to which we add a set of 6 ER stress-related genes, that were identified in the single cell data analysis (see details in [@Baron2016]).
csSAM
fits a linear model that estimates interaction terms between cell proportions and a group variable.
This imposes a constraint on the sample size within each group.
For this reason we only inlude into the model the 4 cell types (alpha, beta, ductal and acinar cells), either due to their variation (Figure 3) or biological relevance.
# ER genes genes_ER <- c('HSPA5', 'MAFA', 'HERPUD1', 'DDIT3', 'UCN3', 'NEUROD1') # Fit on ER stress genes and genes regulated beyond cell type proportion differences genes <- union(genes_ER, subset(edger_pvals, Regulated)$Symbol) # NB: we fix seed for reproducibility csfit <- bseqsc_csdiff(eset[genes, ] ~ gender + ageN + hba1c_class2 | alpha + beta + ductal + acinar, verbose = 2, nperms = 5000, .rng = 12345)
We observe a strong signal for up-regulation on beta cells (at FDR 0.05), and milder signals of up and down-regulation on acinar and beta cells respectively (at looser FDR 0.25).
p <- plot(csfit, alt = c('great', 'less'), by = 'alt', xlab = 'Number of significant genes', ylab = 'FDR cutoff') p$data$Cell.type <- gsub(":Hyper", "", p$data$Cell.type) p$data$Alternative <- c('greater' = 'Up-regulated', less = 'Down-regulated')[as.character(p$data$Alternative)] p + theme_get()
We filter these results for genes with the strongest signal in alpha, beta and acinar cells:
cst <- csTopTable(csfit, threshold = c(alpha = 0.05, beta = 0.25, acinar = 0.25), alt = 'min', merge = TRUE, n = Inf, nodups = TRUE) plotEffectSize(cst)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.