View source: R/multiBatchNorm.R
multiBatchNorm | R Documentation |
Perform scaling normalization within each batch to provide comparable results to the lowest-coverage batch.
multiBatchNorm(
...,
batch = NULL,
norm.args = list(),
min.mean = 1,
subset.row = NULL,
normalize.all = FALSE,
preserve.single = TRUE,
assay.type = "counts",
BPPARAM = SerialParam()
)
... |
One or more SingleCellExperiment objects containing counts and size factors. Each object should contain the same number of rows, corresponding to the same genes in the same order. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, Alternatively, one or more lists of SingleCellExperiments can be provided;
this is flattened as if the objects inside were passed directly to |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
norm.args |
A named list of further arguments to pass to |
min.mean |
A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. |
subset.row |
A vector specifying which features to use for normalization. |
normalize.all |
A logical scalar indicating whether normalized values should be returned for all genes. |
preserve.single |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |
assay.type |
A string specifying which assay values contains the counts. |
BPPARAM |
A BiocParallelParam object specifying whether calculations should be parallelized. |
When performing integrative analyses of multiple batches, it is often the case that different batches have large differences in sequencing depth. This function removes systematic differences in coverage across batches to simplify downstream comparisons. It does so by resaling the size factors using median-based normalization on the ratio of the average counts between batches. This is roughly equivalent to the between-cluster normalization described by Lun et al. (2016).
This function will adjust the size factors so that counts in high-coverage batches are scaled downwards to match the coverage of the most shallow batch.
The logNormCounts
function will then add the same pseudo-count to all batches before log-transformation.
By scaling downwards, we favour stronger squeezing of log-fold changes from the pseudo-count, mitigating any technical differences in variance between batches.
Of course, genuine biological differences will also be shrunk, but this is less of an issue for upregulated genes with large counts.
Only genes with library size-adjusted average counts greater than min.mean
will be used for computing the rescaling factors.
This improves precision and avoids problems with discreteness.
By default, we use min.mean=1
, which is usually satisfactory but may need to be lowered for very sparse datasets.
Users can also set subset.row
to restrict the set of genes used for computing the rescaling factors.
By default, normalized values will only be returned for genes specified in the subset.
Setting normalize.all=TRUE
will return normalized values for all genes.
A list of SingleCellExperiment objects with normalized log-expression values in the "logcounts"
assay (depending on values in norm.args
).
Each object contains cells from a single batch.
If preserve.single=TRUE
and ...
contains only one SingleCellExperiment, that object is returned with an additional "logcounts"
assay containing normalized log-expression values.
The order of cells is not changed.
For comparison, imagine if we ran logNormCounts
separately in each batch prior to correction.
Size factors will be computed within each batch, and batch-specific application in logNormCounts
will not account for scaling differences between batches.
In contrast, multiBatchNorm
will rescale the size factors so that they are comparable across batches.
This removes at least one difference between batches to facilitate easier correction.
cosineNorm
performs a similar role of equalizing the scale of expression values across batches.
However, the advantage of multiBatchNorm
is that its output is more easily interpreted -
the normalized values remain on the log-scale and differences can still be interpreted (roughly) as log-fold changes.
The output can then be fed into downstream analysis procedures (e.g., HVG detection) in the same manner as typical log-normalized values from logNormCounts
.
Aaron Lun
Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html
mnnCorrect
and fastMNN
, for methods that can benefit from rescaling.
logNormCounts
for the calculation of log-transformed normalized expression values.
applyMultiSCE
, to apply this function over the altExps
in x
.
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100)
sce1 <- SingleCellExperiment(list(counts=d1))
sizeFactors(sce1) <- runif(ncol(d1))
d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40)
sce2 <- SingleCellExperiment(list(counts=d2))
sizeFactors(sce2) <- runif(ncol(d2))
out <- multiBatchNorm(sce1, sce2)
summary(sizeFactors(out[[1]]))
summary(sizeFactors(out[[2]]))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.