glm_gp | R Documentation |
This function provides a simple to use interface to fit Gamma-Poisson generalized linear models. It works equally well for small scale (a single model) and large scale data (e.g. thousands of rows and columns, potentially stored on disk). The function automatically determines the appropriate size factors for each sample and efficiently finds the best overdispersion parameter for each gene.
glm_gp(
data,
design = ~1,
col_data = NULL,
reference_level = NULL,
offset = 0,
size_factors = c("normed_sum", "deconvolution", "poscounts", "ratio"),
overdispersion = TRUE,
overdispersion_shrinkage = TRUE,
ridge_penalty = 0,
do_cox_reid_adjustment = TRUE,
subsample = FALSE,
on_disk = NULL,
use_assay = NULL,
verbose = FALSE
)
data |
any matrix-like object (e.g. matrix, DelayedArray, HDF5Matrix) or
anything that can be cast to a SummarizedExperiment (e.g. |
design |
a specification of the experimental design used to fit the Gamma-Poisson GLM.
It can be a |
col_data |
a dataframe with one row for each sample in |
reference_level |
a single string that specifies which level is used as reference
when the model matrix is created. The reference level becomes the intercept and all
other coefficients are calculated with respect to the |
offset |
Constant offset in the model in addition to |
size_factors |
in large scale experiments, each sample is typically of different size
(for example different sequencing depths). A size factor is an internal mechanism of GLMs to
correct for this effect. |
overdispersion |
the simplest count model is the Poisson model. However, the Poisson model
assumes that
Note that |
overdispersion_shrinkage |
the overdispersion can be difficult to estimate with few replicates. To
improve the overdispersion estimates, we can share information across genes and shrink each individual
overdispersion estimate towards a global overdispersion estimate. Empirical studies show however that
the overdispersion varies based on the mean expression level (lower expression level => higher
dispersion). If |
ridge_penalty |
to avoid overfitting, we can penalize fits with large coefficient estimates. Instead
of directly minimizing the deviance per gene (
Default: |
do_cox_reid_adjustment |
the classical maximum likelihood estimator of the |
subsample |
the estimation of the overdispersion is the slowest step when fitting
a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up
without loosing much precision by fitting the overdispersion only on a random subset of the samples.
Default: |
on_disk |
a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
to reduce the memory usage. Processing in memory can be significantly faster than on disk.
Default: |
use_assay |
Specify which assay to use. Default: |
verbose |
a boolean that indicates if information about the individual steps are printed
while fitting the GLM. Default: |
The method follows the following steps:
The size factors are estimated.
If size_factors = "normed_sum"
, the column-sum for each cell is calculated and the resulting
size factors are normalized so that their geometric mean is 1
. If size_factors = "poscounts"
,
a slightly adapted version of the procedure proposed by Anders and Huber (2010) in
equation (5) is used. To handle the large number of zeros the geometric means are calculated for
Y + 0.5
and ignored during the calculation of the median. Columns with all zeros get a
default size factor of 0.001
. If size_factors = "deconvolution"
, the method
scran::calculateSumFactors()
is called. If size_factors = "ratio"
, the unmodified procedure
from Anders and Huber (2010) in equation (5) is used.
The dispersion estimates are initialized based on the moments of each row of Y
.
The coefficients of the model are estimated.
If all samples belong to the same condition (i.e. design = ~ 1
), the betas are estimated using
a quick Newton-Raphson algorithm. This is similar to the behavior of edgeR
. For more complex
designs, the general Fisher-scoring algorithm is used. Here, the code is based on a fork of the
internal function fitBeta()
from DESeq2
. It does however contain some modification to make
the fit more robust and faster.
The mean for each gene and sample is calculate.
Note that this step can be very IO intensive if data
is or contains a DelayedArray.
The overdispersion is estimated.
The classical method for estimating the overdispersion for each gene is to maximize the
Gamma-Poisson log-likelihood by iterating over each count and summing the the corresponding
log-likelihood. It is however, much more efficient
for genes with many small counts to work on the contingency table of the counts. Originally, this
approach had already been used by Anscombe (1950). In this package, I have implemented an
extension of their method that can handle general offsets.
See also overdispersion_mle()
.
The beta coefficients are estimated once more with the updated overdispersion estimates
The mean for each gene and sample is calculated again.
This method can handle not just in memory data, but also data stored on disk. This is essential for
large scale datasets with thousands of samples, as they sometimes encountered in modern single-cell
RNA-seq analysis. glmGamPoi
relies on the DelayedArray
and beachmat
package to efficiently
implement the access to the on-disk data.
The method returns a list with the following elements:
Beta
a matrix with dimensions nrow(data) x n_coefficients
where n_coefficients
is
based on the design
argument. It contains the estimated coefficients for each gene.
overdispersions
a vector with length nrow(data)
. The overdispersion parameter for each
gene. It describes how much more the counts vary than one would expect according to the Poisson
model.
overdispersion_shrinkage_list
a list with additional information from the quasi-likelihood
shrinkage. For details see overdispersion_shrinkage()
.
deviances
a vector with the deviance of the fit for each row. The deviance is a measure how well the data is fit by the model. A smaller deviance means a better fit.
Mu
a matrix with the same dimensions as dim(data)
. If the calculation happened on
disk, than Mu
is a HDF5Matrix
. It contains the estimated mean value for each gene and
sample.
size_factors
a vector with length ncol(data)
. The size factors are the inferred
correction factors for different sizes of each sample. They are also sometimes called the
exposure factor.
Offset
a matrix with the same dimensions as dim(data)
. If the calculation happened on
disk, than Offset
is a HDF5Matrix
. It contains the log(size_factors) + offset
from the
function call.
data
a SummarizedExperiment
that contains the input counts and the col_data
model_matrix
a matrix with dimensions ncol(data) x n_coefficients
. It is build based
on the design
argument.
design_formula
the formula that used to fit the model, or NULL
otherwise
ridge_penalty
a vector with the specification of the ridge penalty.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042.
Anders Simon, & Huber Wolfgang. (2010). Differential expression analysis for sequence count data. Genome Biology. https://doi.org/10.1016/j.jcf.2018.05.006.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616.
Lun ATL, Pagès H, Smith ML (2018). “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol., 14(5), e1006135. doi: 10.1371/journal.pcbi.1006135..
Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826.
Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75 https://doi.org/10.1186/s13059-016-0947-7
overdispersion_mle()
and overdispersion_shrinkage()
for the internal functions that do the
work. For differential expression analysis, see test_de()
.
set.seed(1)
# The simplest example
y <- rnbinom(n = 10, mu = 3, size = 1/2.4)
c(glm_gp(y, size_factors = FALSE))
# Fitting a whole matrix
model_matrix <- cbind(1, rnorm(5))
true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3))
sf <- exp(rnorm(n = 5, mean = 0.7))
model_matrix
Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4),
nrow = 30, ncol = 5)
fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE)
summary(fit)
# Fitting a model with covariates
data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE),
city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE),
age = rnorm(n = 50, mean = 40, sd = 15))
Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50)
fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data)
summary(fit)
# Specify 'ridge_penalty' to penalize extreme Beta coefficients
fit_reg <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data, ridge_penalty = 1.5)
summary(fit_reg)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.