View source: R/normalization.R
getSpikeInNFs | R Documentation |
Use getSpikeInNFs
to obtain the spike-in normalization factors, or
spikeInNormGRanges
to return the input GRanges objects with their
readcounts spike-in normalized.
getSpikeInNFs(
dataset.gr,
si_pattern = NULL,
si_names = NULL,
method = c("SRPMC", "SNR", "RPM"),
batch_norm = TRUE,
ctrl_pattern = NULL,
ctrl_names = NULL,
field = "score",
sample_names = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
spikeInNormGRanges(
dataset.gr,
si_pattern = NULL,
si_names = NULL,
method = c("SRPMC", "SNR", "RPM"),
batch_norm = TRUE,
ctrl_pattern = NULL,
ctrl_names = NULL,
field = "score",
sample_names = NULL,
expand_ranges = FALSE,
ncores = getOption("mc.cores", 2L)
)
dataset.gr |
A GRanges object, or (more typically) a list of GRanges objects. |
si_pattern |
A regular expression that matches spike-in chromosomes. Can
be used in addition to, or as an alternative to |
si_names |
A character vector giving the names of the spike-in
chromosomes. Can be used in addition to, or as an alternative to
|
method |
One of the shown methods, which generate normalization factors for converting raw readcounts into "Spike-in normalized Reads Per Million mapped in Control" (the default), "Spike-in Normalized Read counts", or "Reads Per Million mapped". See descriptions below. |
batch_norm |
A logical indicating if batch normalization should be used
( |
ctrl_pattern |
A regular expression that matches negative control sample names. |
ctrl_names |
A character vector giving the names of the negative control
samples. Can be used as an alternative to |
field |
The metadata field in |
sample_names |
An optional character vector that can be used to rename
the samples in |
expand_ranges |
Logical indicating if ranges in |
ncores |
The number of cores to use for computations. |
A numeric vector of normalization factors for each sample in
dataset.gr
. Normalization factors are to be applied by
multiplication.
This is the default spike-in normalization method, as its meaning is the
most portable and generalizable. Experimental Reads Per Spike-in read (RPS)
are calculated for each sample, i
:
RPS_i=\frac{experimental\_reads_i}{ spikein\_reads_i}
RPS for each sample is divided by RPS for the negative control, which measures the change in total material vs. the negative control. This global adjustment is applied to standard RPM normalization for each sample:
NF_i=\frac{RPS_i}{RPS_{control}} \cdot \frac{1 x
10^6}{experimental\_reads_i}
Thus, the negative control(s) are simply RPM-normalized, while the other conditions are in equivalent, directly-comparable units ("Reads Per Million mapped reads in a negative control").
If batch_norm = TRUE
(the default), all negative controls will be
RPM-normalized, and the global changes in material for all other samples
are calculated within each batch (vs. the negative control within
the same batch).
If batch_norm = FALSE
, all samples are compared to the average RPS
of the negative controls. This method can only be justified if batch has
less effect on RPS than other sources of variation.
If batch_norm = FALSE
, these
normalization factors act to scale down the readcounts in each sample to
make the spike-in read counts match the sample with the lowest number of
spike-in reads:
NF_i=\frac{min(spikein\_reads)}{spikein\_reads_i}
If batch_norm = TRUE
, such normalization factors are calculated
within each batch, but a final batch (replicate) adjustment is performed
that results in the negative controls having the same normalized
readcounts. In this way, the negative controls are used to adjust the
normalized readcounts of their entire replicate. Just as when
batch_norm = FALSE
, one of the normalization factors will be
1
, while the rest will be <1
.
One use for these normalization factors is for normalizing-by-subsampling;
see subsampleBySpikeIn
.
A simple convenience wrapper for calculating normalization factors for RPM normalization:
NF_i=\frac{1 x 10^6}{experimental\_reads_i}
If spike-in reads are present, they're removed before the normalization factors are calculated.
Mike DeBerardine
getSpikeInCounts
,
applyNFsGRanges
,
subsampleBySpikeIn
#--------------------------------------------------#
# Make list of dummy GRanges
#--------------------------------------------------#
gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"),
ranges = IRanges(start = 1:4, width = 1),
strand = "+")
gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1
# set readcounts
score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total
score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total
score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total
score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total
grl <- list(gr1_rep1, gr2_rep1,
gr1_rep2, gr2_rep2)
names(grl) <- c("gr1_rep1", "gr2_rep1",
"gr1_rep2", "gr2_rep2")
grl
#--------------------------------------------------#
# Get RPM NFs
#--------------------------------------------------#
# can use the names of all spike-in chromosomes
getSpikeInNFs(grl, si_names = c("spikechr1", "spikechr2"),
method = "RPM", ncores = 1)
# or use a regular expression that matches the spike-in chromosome names
grep("spike", as.vector(seqnames(gr1_rep1)))
getSpikeInNFs(grl, si_pattern = "spike", method = "RPM", ncores = 1)
#--------------------------------------------------#
# Get simple spike-in NFs ("SNR")
#--------------------------------------------------#
# without batch normalization, NFs make all spike-in readcounts match
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
method = "SNR", batch_norm = FALSE, ncores = 1)
# with batch normalization, controls will have the same normalized counts;
# other samples are normalized to have same spike-in reads as their matched
# control
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
method = "SNR", batch_norm = TRUE, ncores = 1)
#--------------------------------------------------#
# Get spike-in NFs with more meaningful units ("RPMC")
#--------------------------------------------------#
# compare to raw RPM NFs above; takes into account spike-in reads;
# units are directly comparable to the negative controls
# with batch normalization, these negative controls are the same, as they
# have the same number of non-spike-in readcounts (they're simply RPM)
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)
# batch_norm = FALSE, the average reads-per-spike-in for the negative
# controls are used to calculate all NFs; unless the controls have the exact
# same ratio of non-spike-in to spike-in reads, nothing is precisely RPM
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
batch_norm = FALSE, ncores = 1)
#--------------------------------------------------#
# Apply NFs to the GRanges
#--------------------------------------------------#
spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1",
ncores = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.