knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) library(BiocStyle) set.seed(10918)
r Biocpkg("scuttle")
provides various low-level utilities for single-cell RNA-seq data analysis,
typically used at the start of the analysis workflows or within high-level functions in other packages.
This vignette will discuss the use of scaling normalization for removing cell-specific biases.
To demonstrate, we will obtain the classic Zeisel dataset from the r Biocpkg("scRNAseq")
package
and apply some quick quality control to remove damaged cells.
library(scRNAseq) sce <- ZeiselBrainData() library(scuttle) sce <- quickPerCellQC(sce, subsets=list(Mito=grep("mt-", rownames(sce))), sub.fields=c("subsets_Mito_percent", "altexps_ERCC_percent")) sce
# Make the damn thing sparse for speed. counts(sce) <- as(counts(sce), "dgCMatrix")
Note: A more comprehensive description of the use of r Biocpkg("scuttle")
functions
(along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
Scaling normalization involves dividing the counts for each cell by a cell-specific "size factor" to adjust for uninteresting differences in sequencing depth and capture efficiency.
The librarySizeFactors()
function provides a simple definition of the size factor for each cell,
computed as the library size of each cell after scaling them to have a mean of 1 across all cells.
This is fast but inaccurate in the presence of differential expression between cells that introduce composition biases.
summary(librarySizeFactors(sce))
The geometricSizeFactors()
function instead computes the geometric mean within each cell.
This is more robust to composition biases but is only accurate when the counts are large and there are few zeroes.
summary(geometricSizeFactors(sce))
The medianSizeFactors()
function uses a r Biocpkg("DESeq2")
-esque approach based on the median ratio from an average pseudo-cell.
Briefly, we assume that most genes are non-DE, such that any systematic fold difference in coverage (as defined by the median ratio) represents technical biases that must be removed.
This is highly robust to composition biases but relies on sufficient sequencing coverage to obtain well-defined ratios.
summary(medianSizeFactors(sce))
All of these size factors can be stored in the SingleCellExperiment
via the sizeFactors<-()
setter function.
Most downstream functions will pick these up automatically for any calculations that rely on size factors.
sizeFactors(sce) <- librarySizeFactors(sce)
Alternatively, functions like computeLibraryFactors()
can automatically compute and attach the size factors to our SingleCellExperiment
object.
sce <- computeLibraryFactors(sce) summary(sizeFactors(sce))
The computePooledFactors
method implements the pooling strategy for scaling normalization.
This uses an approach similar to medianSizeFactors()
to remove composition biases,
but pools cells together to overcome problems with discreteness at low counts.
Per-cell factors are then obtained from the pools using a deconvolution strategy.
library(scran) clusters <- quickCluster(sce) sce <- computePooledFactors(sce, clusters=clusters) summary(sizeFactors(sce))
For larger data sets, a rough clustering should be performed prior to normalization.
computePooledFactors()
will then automatically apply normalization within each cluster first, before adjusting the scaling factors to be comparable across clusters.
This reduces the risk of violating our assumptions (of a non-DE majority of genes) when many genes are DE between clusters in a heterogeneous population.
In this case, we use the quickCluster()
function from the r Biocpkg("scran")
package, but any clustering function can be used, for example:
sce <- computePooledFactors(sce, clusters=sce$level1class) summary(sizeFactors(sce))
We assume that quality control on the cells has already been performed prior to running this function. Low-quality cells with few expressed genes can often have negative size factor estimates.
An alternative approach is to normalize based on the spike-in counts. The idea is that the same quantity of spike-in RNA was added to each cell prior to library preparation. Size factors are computed to scale the counts such that the total coverage of the spike-in transcripts is equal across cells.
sce2 <- computeSpikeFactors(sce, "ERCC") summary(sizeFactors(sce2))
The main practical difference from the other strategies is that spike-in normalization preserves differences in total RNA content between cells, whereas computePooledFactors
and other non-DE methods do not.
This can be important in certain applications where changes in total RNA content are associated with a biological phenomenon of interest.
Regardless of which size factor calculation we pick, the calculation of normalized expression values simply involves dividing each count by the size factor for the cell.
This eliminates the cell-specific scaling effect for valid comparisons between cells in downstream analyses.
The simplest approach to computing these values is to use the logNormCounts()
function:
sce <- logNormCounts(sce) assayNames(sce)
This computes log~2~-transformed normalized expression values by adding a constant pseudo-count and log-transforming.
The resulting values can be roughly interpreted on the same scale as log-transformed counts and are stored in "logcounts"
.
This is the most common expression measure for downstream analyses as differences between values can be treated as log-fold changes.
For example, Euclidean distances between cells are analogous to the average log-fold change across genes.
Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.
Here, we use the normalizeCounts()
function to perform some custom normalization with random size factors.
assay(sce, "normed") <- normalizeCounts(sce, log=FALSE, size.factors=runif(ncol(sce)), pseudo.count=1.5)
r Biocpkg("scuttle")
can also calculate counts-per-million using the aptly-named calculateCPM()
function.
The output is most appropriately stored as an assay named "cpm"
in the assays of the SingleCellExperiment
object.
Related functions include calculateTPM()
and calculateFPKM()
, which do pretty much as advertised.
assay(sce, "cpm") <- calculateCPM(sce)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.