getSpikeInNFs: Calculating spike-in normalization factors
In mdeber/BRGenomics: Tools for the Efficient Analysis of High-Resolution Genomics Data

getSpikeInNFs

R Documentation

Calculating spike-in normalization factors

Description

Use getSpikeInNFs to obtain the spike-in normalization factors, or spikeInNormGRanges to return the input GRanges objects with their readcounts spike-in normalized.

Usage

getSpikeInNFs(
  dataset.gr,
  si_pattern = NULL,
  si_names = NULL,
  method = c("SRPMC", "SNR", "RPM"),
  batch_norm = TRUE,
  ctrl_pattern = NULL,
  ctrl_names = NULL,
  field = "score",
  sample_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

spikeInNormGRanges(
  dataset.gr,
  si_pattern = NULL,
  si_names = NULL,
  method = c("SRPMC", "SNR", "RPM"),
  batch_norm = TRUE,
  ctrl_pattern = NULL,
  ctrl_names = NULL,
  field = "score",
  sample_names = NULL,
  expand_ranges = FALSE,
  ncores = getOption("mc.cores", 2L)
)

Arguments

`dataset.gr`	A GRanges object, or (more typically) a list of GRanges objects.
`si_pattern`	A regular expression that matches spike-in chromosomes. Can be used in addition to, or as an alternative to `si_names`.
`si_names`	A character vector giving the names of the spike-in chromosomes. Can be used in addition to, or as an alternative to `si_pattern`.
`method`	One of the shown methods, which generate normalization factors for converting raw readcounts into "Spike-in normalized Reads Per Million mapped in Control" (the default), "Spike-in Normalized Read counts", or "Reads Per Million mapped". See descriptions below.
`batch_norm`	A logical indicating if batch normalization should be used (`TRUE` by default). See descriptions below. If batch normalization is used, sample names must end with "rep#", wherein "#" is one or more characters (usually a number) giving the replicate. If this is not the case, users can use the `sample_names` argument to make the names conform.
`ctrl_pattern`	A regular expression that matches negative control sample names.
`ctrl_names`	A character vector giving the names of the negative control samples. Can be used as an alternative to `ctrl_pattern`.
`field`	The metadata field in `dataset.gr` that contains raw readcounts. If each range is an individual read, set `field = NULL`.
`sample_names`	An optional character vector that can be used to rename the samples in `dataset.gr`. Intended use is if `dataset.gr` is an unnamed list, or if `batch_norm = TRUE` but the sample names don't conform to the required naming scheme.
`expand_ranges`	Logical indicating if ranges in `dataset.gr` should be treated as descriptions of single molecules (`FALSE`), or if ranges should be treated as representing multiple adjacent positions with the same signal (`TRUE`). See `getCountsByRegions`.
`ncores`	The number of cores to use for computations.

Value

A numeric vector of normalization factors for each sample in dataset.gr. Normalization factors are to be applied by multiplication.

Spike-in normalized Reads Per Million mapped in Control (SRPMC)

This is the default spike-in normalization method, as its meaning is the most portable and generalizable. Experimental Reads Per Spike-in read (RPS) are calculated for each sample, i:

RPS_i=\frac{experimental\_reads_i}{ spikein\_reads_i}

RPS for each sample is divided by RPS for the negative control, which measures the change in total material vs. the negative control. This global adjustment is applied to standard RPM normalization for each sample:

NF_i=\frac{RPS_i}{RPS_{control}} \cdot \frac{1 x 10^6}{experimental\_reads_i}

Thus, the negative control(s) are simply RPM-normalized, while the other conditions are in equivalent, directly-comparable units ("Reads Per Million mapped reads in a negative control").

If batch_norm = TRUE (the default), all negative controls will be RPM-normalized, and the global changes in material for all other samples are calculated within each batch (vs. the negative control within the same batch).

If batch_norm = FALSE, all samples are compared to the average RPS of the negative controls. This method can only be justified if batch has less effect on RPS than other sources of variation.

Spike-in Normalized Reads (SNR)

If batch_norm = FALSE, these normalization factors act to scale down the readcounts in each sample to make the spike-in read counts match the sample with the lowest number of spike-in reads:

NF_i=\frac{min(spikein\_reads)}{spikein\_reads_i}

If batch_norm = TRUE, such normalization factors are calculated within each batch, but a final batch (replicate) adjustment is performed that results in the negative controls having the same normalized readcounts. In this way, the negative controls are used to adjust the normalized readcounts of their entire replicate. Just as when batch_norm = FALSE, one of the normalization factors will be 1, while the rest will be <1.

One use for these normalization factors is for normalizing-by-subsampling; see subsampleBySpikeIn.

Reads Per Million mapped reads (RPM)

A simple convenience wrapper for calculating normalization factors for RPM normalization:

NF_i=\frac{1 x 10^6}{experimental\_reads_i}

If spike-in reads are present, they're removed before the normalization factors are calculated.

Author(s)

Mike DeBerardine

Examples

#--------------------------------------------------#
# Make list of dummy GRanges
#--------------------------------------------------#
gr1_rep1 <- GRanges(seqnames = c("chr1", "chr2", "spikechr1", "spikechr2"),
                    ranges = IRanges(start = 1:4, width = 1),
                    strand = "+")
gr2_rep2 <- gr2_rep1 <- gr1_rep2 <- gr1_rep1

# set readcounts
score(gr1_rep1) <- c(1, 1, 1, 1) # 2 exp + 2 spike = 4 total
score(gr2_rep1) <- c(2, 2, 1, 1) # 4 exp + 2 spike = 6 total
score(gr1_rep2) <- c(1, 1, 2, 1) # 2 exp + 3 spike = 5 total
score(gr2_rep2) <- c(4, 4, 2, 2) # 8 exp + 4 spike = 12 total

grl <- list(gr1_rep1, gr2_rep1,
            gr1_rep2, gr2_rep2)
names(grl) <- c("gr1_rep1", "gr2_rep1",
                "gr1_rep2", "gr2_rep2")

grl

#--------------------------------------------------#
# Get RPM NFs
#--------------------------------------------------#

# can use the names of all spike-in chromosomes
getSpikeInNFs(grl, si_names = c("spikechr1", "spikechr2"),
              method = "RPM", ncores = 1)

# or use a regular expression that matches the spike-in chromosome names
grep("spike", as.vector(seqnames(gr1_rep1)))

getSpikeInNFs(grl, si_pattern = "spike", method = "RPM", ncores = 1)


#--------------------------------------------------#
# Get simple spike-in NFs ("SNR")
#--------------------------------------------------#

# without batch normalization, NFs make all spike-in readcounts match
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              method = "SNR", batch_norm = FALSE, ncores = 1)

# with batch normalization, controls will have the same normalized counts;
# other samples are normalized to have same spike-in reads as their matched
# control
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              method = "SNR", batch_norm = TRUE, ncores = 1)

#--------------------------------------------------#
# Get spike-in NFs with more meaningful units ("RPMC")
#--------------------------------------------------#

# compare to raw RPM NFs above; takes into account spike-in reads;
# units are directly comparable to the negative controls

# with batch normalization, these negative controls are the same, as they
# have the same number of non-spike-in readcounts (they're simply RPM)
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1", ncores = 1)

# batch_norm = FALSE, the average reads-per-spike-in for the negative
# controls are used to calculate all NFs; unless the controls have the exact
# same ratio of non-spike-in to spike-in reads, nothing is precisely RPM
getSpikeInNFs(grl, si_pattern = "spike", ctrl_pattern = "gr1",
              batch_norm = FALSE, ncores = 1)

#--------------------------------------------------#
# Apply NFs to the GRanges
#--------------------------------------------------#

spikeInNormGRanges(grl, si_pattern = "spike", ctrl_pattern = "gr1",
                   ncores = 1)

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.