multiBatchNorm: Per-batch scaling normalization

View source: R/multiBatchNorm.R

multiBatchNormR Documentation

Per-batch scaling normalization

Description

Perform scaling normalization within each batch to provide comparable results to the lowest-coverage batch.

Usage

multiBatchNorm(
  ...,
  batch = NULL,
  norm.args = list(),
  min.mean = 1,
  subset.row = NULL,
  normalize.all = FALSE,
  preserve.single = TRUE,
  assay.type = "counts",
  BPPARAM = SerialParam()
)

Arguments

...

One or more SingleCellExperiment objects containing counts and size factors. Each object should contain the same number of rows, corresponding to the same genes in the same order.

If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch. If a single object is supplied, batch should also be specified.

Alternatively, one or more lists of SingleCellExperiments can be provided; this is flattened as if the objects inside were passed directly to ....

batch

A factor specifying the batch of origin for all cells when only a single object is supplied in .... This is ignored if multiple objects are present.

norm.args

A named list of further arguments to pass to logNormCounts.

min.mean

A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization.

subset.row

A vector specifying which features to use for normalization.

normalize.all

A logical scalar indicating whether normalized values should be returned for all genes.

preserve.single

A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in ....

assay.type

A string specifying which assay values contains the counts.

BPPARAM

A BiocParallelParam object specifying whether calculations should be parallelized.

Details

When performing integrative analyses of multiple batches, it is often the case that different batches have large differences in sequencing depth. This function removes systematic differences in coverage across batches to simplify downstream comparisons. It does so by resaling the size factors using median-based normalization on the ratio of the average counts between batches. This is roughly equivalent to the between-cluster normalization described by Lun et al. (2016).

This function will adjust the size factors so that counts in high-coverage batches are scaled downwards to match the coverage of the most shallow batch. The logNormCounts function will then add the same pseudo-count to all batches before log-transformation. By scaling downwards, we favour stronger squeezing of log-fold changes from the pseudo-count, mitigating any technical differences in variance between batches. Of course, genuine biological differences will also be shrunk, but this is less of an issue for upregulated genes with large counts.

Only genes with library size-adjusted average counts greater than min.mean will be used for computing the rescaling factors. This improves precision and avoids problems with discreteness. By default, we use min.mean=1, which is usually satisfactory but may need to be lowered for very sparse datasets.

Users can also set subset.row to restrict the set of genes used for computing the rescaling factors. By default, normalized values will only be returned for genes specified in the subset. Setting normalize.all=TRUE will return normalized values for all genes.

Value

A list of SingleCellExperiment objects with normalized log-expression values in the "logcounts" assay (depending on values in norm.args). Each object contains cells from a single batch.

If preserve.single=TRUE and ... contains only one SingleCellExperiment, that object is returned with an additional "logcounts" assay containing normalized log-expression values. The order of cells is not changed.

Comparison to other normalization strategies

For comparison, imagine if we ran logNormCounts separately in each batch prior to correction. Size factors will be computed within each batch, and batch-specific application in logNormCounts will not account for scaling differences between batches. In contrast, multiBatchNorm will rescale the size factors so that they are comparable across batches. This removes at least one difference between batches to facilitate easier correction.

cosineNorm performs a similar role of equalizing the scale of expression values across batches. However, the advantage of multiBatchNorm is that its output is more easily interpreted - the normalized values remain on the log-scale and differences can still be interpreted (roughly) as log-fold changes. The output can then be fed into downstream analysis procedures (e.g., HVG detection) in the same manner as typical log-normalized values from logNormCounts.

Author(s)

Aaron Lun

References

Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html

See Also

mnnCorrect and fastMNN, for methods that can benefit from rescaling.

logNormCounts for the calculation of log-transformed normalized expression values.

applyMultiSCE, to apply this function over the altExps in x.

Examples

d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100)
sce1 <- SingleCellExperiment(list(counts=d1))
sizeFactors(sce1) <- runif(ncol(d1))

d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40)
sce2 <- SingleCellExperiment(list(counts=d2))
sizeFactors(sce2) <- runif(ncol(d2))

out <- multiBatchNorm(sce1, sce2)
summary(sizeFactors(out[[1]]))
summary(sizeFactors(out[[2]]))


LTLA/batchelor documentation built on July 10, 2024, 9:09 p.m.