R/glm_gp.R
In glmGamPoi: Fit a Gamma-Poisson Generalized Linear Model

Documented in glm_gp

#' Fit a Gamma-Poisson Generalized Linear Model
#'
#' This function provides a simple to use interface to fit Gamma-Poisson generalized
#' linear models. It works equally well for small scale (a single model) and large scale data
#' (e.g. thousands of rows and columns, potentially stored on disk). The function
#' automatically determines the appropriate size factors for each sample and efficiently
#' finds the best overdispersion parameter for each gene.
#'
#' @param data any matrix-like object (e.g. [matrix], [DelayedArray], [HDF5Matrix]) or
#'   anything that can be cast to a [SummarizedExperiment] (e.g. `MSnSet`, `eSet` etc.) with
#'   one column per sample and row per gene.
#' @param design a specification of the experimental design used to fit the Gamma-Poisson GLM.
#'   It can be a [model.matrix()] with one row for each sample and one column for each
#'   coefficient. \cr
#'   Alternatively, `design` can be a `formula`. The entries in the
#'   formula can refer to global objects, columns in the `col_data` parameter, or the `colData(data)`
#'   of `data` if it is a `SummarizedExperiment`. \cr
#'   The third option is that `design` is a vector where each element specifies to which
#'   condition a sample belongs. \cr
#'   Default: `design = ~ 1`, which means that all samples are treated as if they belong to the
#'   same condition. Note that this is the fasted option.
#' @param col_data a dataframe with one row for each sample in `data`. Default: `NULL`.
#' @param reference_level a single string that specifies which level is used as reference
#'   when the model matrix is created. The reference level becomes the intercept and all
#'   other coefficients are calculated with respect to the `reference_level`.
#'   Default: `NULL`.
#' @param offset Constant offset in the model in addition to `log(size_factors)`. It can
#'   either be a single number, a vector of length `ncol(data)` or a matrix with the
#'   same dimensions as `dim(data)`. Note that if data is a [DelayedArray] or [HDF5Matrix],
#'   `offset` must be as well. Default: `0`.
#' @param size_factors in large scale experiments, each sample is typically of different size
#'   (for example different sequencing depths). A size factor is an internal mechanism of GLMs to
#'   correct for this effect.\cr
#'   `size_factors` is either a numeric vector with positive entries that has the same lengths as columns in the data
#'   that specifies the size factors that are used.
#'   Or it can be a string that species the method that is used to estimate the size factors
#'   (one of \code{c("normed_sum", "deconvolution", "poscounts")}).
#'   Note that \code{"normed_sum"} and \code{"poscounts"} are fairly
#'   simple methods and can lead to suboptimal results. For the best performance, I recommend to use
#'   `size_factors = "deconvolution"` which calls `scran::calculateSumFactors()`. However, you need
#'   to separately install the `scran` package from Bioconductor for this method to work.
#'   Also note that `size_factors = 1` and `size_factors = FALSE` are equivalent. If only a single gene is given,
#'   no size factor is estimated (ie. `size_factors = 1`). Default: `"normed_sum"`.
#' @param overdispersion the simplest count model is the Poisson model. However, the Poisson model
#'   assumes that \eqn{variance = mean}. For many applications this is too rigid and the Gamma-Poisson
#'   allows a more flexible mean-variance relation (\eqn{variance = mean + mean^2 * overdispersion}). \cr
#'   `overdispersion` can either be
#'   \itemize{
#'      \item a single boolean that indicates if an overdispersion is estimated for each gene.
#'      \item a numeric vector of length `nrow(data)` fixing the overdispersion to those values.
#'      \item the string `"global"` to indicate that one dispersion is fit across all genes.
#'   }
#'   Note that `overdispersion = 0` and `overdispersion = FALSE` are equivalent and both reduce
#'   the Gamma-Poisson to the classical Poisson model. Default: `TRUE`.
#' @param overdispersion_shrinkage the overdispersion can be difficult to estimate with few replicates. To
#'   improve the overdispersion estimates, we can share information across genes and shrink each individual
#'   overdispersion estimate towards a global overdispersion estimate. Empirical studies show however that
#'   the overdispersion varies based on the mean expression level (lower expression level => higher
#'   dispersion). If `overdispersion_shrinkage = TRUE`, a median trend of dispersion and expression level is
#'   fit and used to estimate the variances of a quasi Gamma Poisson model (Lund et al. 2012). Default: `TRUE`.
#' @param do_cox_reid_adjustment the classical maximum likelihood estimator of the `overdisperion` is biased
#'   towards small values. McCarthy _et al._ (2012) showed that it is preferable to optimize the Cox-Reid
#'   adjusted profile likelihood.\cr
#'   `do_cox_reid_adjustment` can be either be `TRUE` or `FALSE` to indicate if the adjustment is
#'   added during the optimization of the `overdispersion` parameter. Default: `TRUE`.
#' @param subsample the estimation of the overdispersion is the slowest step when fitting
#'   a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up
#'   without loosing much precision by fitting the overdispersion only on a random subset of the samples.
#'   Default: `FALSE` which means that the data is not subsampled. If set to `TRUE`, at most 1,000 samples
#'   are considered. Otherwise the parameter just specifies the number of samples that are considered
#'   for each gene to estimate the overdispersion.
#' @param on_disk a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
#'   to reduce the memory usage. Processing in memory can be significantly faster than on disk.
#'   Default: `NULL` which means that the data is only processed in memory if `data` is an in-memory
#'   data structure.
#' @param verbose a boolean that indicates if information about the individual steps are printed
#'   while fitting the GLM. Default: `FALSE`.
#'
#'
#' @details
#' The method follows the following steps:
#'
#' 1. The size factors are estimated.\cr
#'    If `size_factors = "normed_sum"`, the column-sum for each cell is calculated and the resulting
#'    size factors are normalized so that their geometric mean is `1`. If `size_factors = "poscounts"`,
#'    a slightly adapted version of the procedure proposed by Anders and Huber (2010) in
#'    equation (5) is used. To handle the large number of zeros the geometric means are calculated for
#'    \eqn{Y + 0.5} and ignored during the calculation of the median. Columns with all zeros get a
#'    default size factor of \eqn{0.001}. If `size_factors = "deconvolution"`, the method
#'    `scran::calculateSumFactors()` is called.
#' 2. The dispersion estimates are initialized based on the moments of each row of \eqn{Y}.
#' 3. The coefficients of the model are estimated.\cr
#'    If all samples belong to the same condition (i.e. `design = ~ 1`), the betas are estimated using
#'    a quick Newton-Raphson algorithm. This is similar to the behavior of `edgeR`. For more complex
#'    designs, the general Fisher-scoring algorithm is used. Here, the code is based on a fork  of the
#'    internal function `fitBeta()` from `DESeq2`. It does however contain some modification to make
#'    the fit more robust and faster.
#' 4. The mean for each gene and sample is calculate.\cr
#'    Note that this step can be very IO intensive if `data` is or contains a DelayedArray.
#' 5. The overdispersion is estimated.\cr
#'    The classical method for estimating the overdispersion for each gene is to maximize the
#'    Gamma-Poisson log-likelihood by iterating over each count and summing the the corresponding
#'    log-likelihood. It is however, much more efficient
#'    for genes with many small counts to work on the contingency table of the counts. Originally, this
#'    approach had already been used by Anscombe (1950). In this package, I have implemented an
#'    extension of their method that can handle general offsets.\cr
#'    See also [overdispersion_mle()].
#'  6. The beta coefficients are estimated once more with the updated overdispersion estimates
#'  7. The mean for each gene and sample is calculated again.
#'
#' This method can handle not just in memory data, but also data stored on disk. This is essential for
#' large scale datasets with thousands of samples, as they sometimes encountered in modern single-cell
#' RNA-seq analysis. `glmGamPoi` relies on the `DelayedArray` and `beachmat` package to efficiently
#' implement the access to the on-disk data.
#'
#' @return
#' The method returns a list with the following elements:
#' \describe{
#'   \item{`Beta`}{a matrix with dimensions `nrow(data) x n_coefficients` where `n_coefficients` is
#'   based on the `design` argument. It contains the estimated coefficients for each gene.}
#'   \item{`overdispersions`}{a vector with length `nrow(data)`. The overdispersion parameter for each
#'   gene. It describes how much more the counts vary than one would expect according to the Poisson
#'   model.}
#'   \item{`overdispersion_shrinkage_list`}{a list with additional information from the quasi-likelihood
#'   shrinkage. For details see [overdispersion_shrinkage()].}
#'   \item{`deviances`}{a vector with the deviance of the fit for each row. The deviance is a measure
#'   how well the data is fit by the model. A smaller deviance means a better fit.}
#'   \item{`Mu`}{a matrix with the same dimensions as `dim(data)`. If the calculation happened on
#'   disk, than `Mu` is a `HDF5Matrix`. It contains the estimated mean value for each gene and
#'   sample.}
#'   \item{`size_factors`}{a vector with length `ncol(data)`. The size factors are the inferred
#'   correction factors for different sizes of each sample. They are also sometimes called the
#'   exposure factor.}
#'   \item{`data`}{a `SummarizedExperiment` that contains the input counts and the `col_data`}
#'   \item{`model_matrix`}{a matrix with dimensions `ncol(data) x n_coefficients`. It is build based
#'   on the `design` argument.}
#' }
#'
#' @examples
#'  set.seed(1)
#'  # The simplest example
#'  y <- rnbinom(n = 10, mu = 3, size = 1/2.4)
#'  c(glm_gp(y, size_factors = FALSE))
#'
#'  # Fitting a whole matrix
#'  model_matrix <- cbind(1, rnorm(5))
#'  true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3))
#'  sf <- exp(rnorm(n = 5, mean = 0.7))
#'  model_matrix
#'  Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4),
#'              nrow = 30, ncol = 5)
#'
#'  fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE)
#'  summary(fit)
#'
#'  # Fitting a model with covariates
#'  data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE),
#'  city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE),
#'  age = rnorm(n = 50, mean = 40, sd = 15))
#'  Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50)
#'  fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data)
#'  summary(fit)
#'
#'
#'
#' @seealso [overdispersion_mle()] and [overdispersion_shrinkage()] for the internal functions that do the
#'   work. For differential expression analysis, see [test_de()].
#'
#' @references
#'   * McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor
#'   RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297.
#'   [https://doi.org/10.1093/nar/gks042](https://doi.org/10.1093/nar/gks042).
#'   * Anders Simon, & Huber Wolfgang. (2010). Differential expression analysis for sequence count data.
#'   Genome Biology. [https://doi.org/10.1016/j.jcf.2018.05.006](https://doi.org/10.1016/j.jcf.2018.05.006).
#'   * Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and  dispersion
#'   for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
#'   [https://doi.org/10.1186/s13059-014-0550-8](https://doi.org/10.1186/s13059-014-0550-8).
#'   * Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential
#'   expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
#'   [https://doi.org/10.1093/bioinformatics/btp616](https://doi.org/10.1093/bioinformatics/btp616).
#'   * Lun ATL, Pagès H, Smith ML (2018). “beachmat: A Bioconductor C++ API for accessing high-throughput
#'   biological data from a variety of R matrix types.” PLoS Comput. Biol., 14(5), e1006135. doi:
#'   [10.1371/journal.pcbi.1006135.](https://doi.org/10.1371/journal.pcbi.1006135).
#'   * Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression
#'   in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical
#'   Applications in Genetics and Molecular Biology, 11(5).
#'   [https://doi.org/10.1515/1544-6115.1826](https://doi.org/10.1515/1544-6115.1826).
#'   * Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing
#'   data with many zero counts. Genome Biol. 17:75
#'   [https://doi.org/10.1186/s13059-016-0947-7](https://doi.org/10.1186/s13059-016-0947-7)
#'
#' @export
glm_gp <- function(data,
                   design = ~ 1,
                   col_data = NULL,
                   reference_level = NULL,
                   offset = 0,
                   size_factors = c("normed_sum", "deconvolution", "poscounts"),
                   overdispersion = TRUE,
                   overdispersion_shrinkage = TRUE,
                   do_cox_reid_adjustment = TRUE,
                   subsample = FALSE,
                   on_disk = NULL,
                   verbose = FALSE){

  # Validate `data`
  if(inherits(data, "formula")){
    if(length(design) != 2 || design != ~ 1){
      stop("If the first argument is already a formula, the second argument must not be set. Please call this function like this:\n",
           "'glm_gp(data = mat, design = ~ a + b + c, ...)'", call. = FALSE)
    }
    extr <- extract_data_from_formula(data, col_data, parent.frame())
    data <- extr$data
    design <- extr$design
  }
  if(is.vector(data)){
    data <- matrix(data, nrow = 1)
  }
  data_mat <- handle_data_parameter(data, on_disk)

  # Convert the formula to a model_matrix
  col_data <- get_col_data(data, col_data)
  des <- handle_design_parameter(design, data, col_data, reference_level)

  # Call glm_gp_impl()
  res <- glm_gp_impl(data_mat,
              model_matrix = des$model_matrix,
              offset = offset,
              size_factors = size_factors,
              overdispersion = overdispersion,
              overdispersion_shrinkage = overdispersion_shrinkage,
              do_cox_reid_adjustment = do_cox_reid_adjustment,
              subsample = subsample,
              verbose = verbose)
  # Make sure that the output is nice and beautiful
  rownames(data_mat) <- rownames(data)
  colnames(data_mat) <- colnames(data)
  res$data <- SummarizedExperiment::SummarizedExperiment(list(counts = data_mat),
                                                         colData = col_data)
  res$model_matrix <- des$model_matrix
  if(is.null(colnames(res$model_matrix))){
    colnames(res$model_matrix) <- paste0("Coef_", seq_len(ncol(res$model_matrix)))
  }
  res$design_formula <- des$design_formula
  colnames(res$Beta) <- colnames(res$model_matrix)
  rownames(res$Beta) <- rownames(data)
  rownames(res$Mu) <- rownames(data)
  colnames(res$Mu) <- colnames(data)
  names(res$overdispersions) <- rownames(data)
  names(res$deviances) <- rownames(data)
  names(res$size_factors) <- colnames(data)

  class(res) <- "glmGamPoi"
  res
}


handle_data_parameter <- function(data, on_disk){
  if(is.matrix(data)){
    if(! is.numeric(data)){
      stop("The data argument must consist of numeric values and not of ", mode(data), " values")
    }
    if(is.null(on_disk) || isFALSE(on_disk)){
      data_mat <- data
    }else if(isTRUE(on_disk)){
      data_mat <- HDF5Array::writeHDF5Array(data)
    }else{
      stop("Illegal argument type for on_disk. Can only handle 'NULL', 'TRUE', or 'FALSE'")
    }
  }else if(is(data, "DelayedArray")){
    if(is.null(on_disk) || isTRUE(on_disk)){
      data_mat <- data
    }else if(isFALSE(on_disk)){
      data_mat <- as.matrix(data)
    }else{
      stop("Illegal argument type for on_disk. Can only handle 'NULL', 'TRUE', or 'FALSE'")
    }
  }else if(is(data, "SummarizedExperiment")){
    data_mat <- handle_data_parameter(SummarizedExperiment::assay(data), on_disk)
  }else if(canCoerce(data, "SummarizedExperiment")){
    se <- as(data, "SummarizedExperiment")
    data_mat <- handle_data_parameter(SummarizedExperiment::assay(se), on_disk)
  }else if(is(data, "dgCMatrix")) {
    stop("glmGamPoi does not yet support sparse input data of type 'dgCMatrix'. ",
         "If your data fits into memory, please cast it to a dense matrix with ",
         "'as.matrix(data)' or if it is too big, convert it to an on-disk dataset ",
         "with the 'HDF5Array' package. ")
  }else{
    stop("Cannot handle data of class ", class(data), ".",
         "It must be of type matrix, DelayedArray, SummarizedExperiment, ",
         "or something that can be cast to a SummarizedExperiment")
  }
  data_mat
}


get_col_data <- function(data, col_data){
  if(is.null(col_data)){
    col_data <- as.data.frame(matrix(numeric(0), nrow=ncol(data)))
  }
  if(is(data, "SummarizedExperiment")){
    cbind(col_data, SummarizedExperiment::colData(data))
  }else{
    col_data
  }
}


handle_design_parameter <- function(design, data, col_data, reference_level){
  n_samples <- ncol(data)

  # Handle the design parameter
  if(is.matrix(design)){
    if(! is.null(reference_level)){
      stop("Cannot specify `design` as a matrix and `reference_level` != NULL.\n",
           "Either provide `design` as a formula and specify `reference_level` or ",
           "make sure that the correct reference level is used when the matrix is created ",
           "for example with `stats::relevel()`.")
    }
    model_matrix <- design
    design_formula <- NULL
  }else if((is.vector(design) || is.factor(design))){
    if(length(design) != n_samples){
      if(length(design) == 1 && design == 1){
        stop("The specified design vector length (", length(design), ") does not match ",
             "the number of samples: ", n_samples, "\n",
             "Did you maybe mean: `design = ~ 1`?")
      }else{
        stop("The specified design vector length (", length(design), ") does not match ",
             "the number of samples: ", n_samples)
      }
    }
    model_matrix <- convert_chr_vec_to_model_matrix(design, reference_level)
    design_formula <- NULL
  }else if(inherits(design,"formula")){
    model_matrix <- convert_formula_to_model_matrix(design, col_data, reference_level)
    design_formula <- design
  }else{
    stop(paste0("design argment of class ", class(design), " is not supported. Please ",
                "specify a `model_matrix`, a `character vector`, or a `formula`."))
  }


  if(nrow(model_matrix) != ncol(data)) stop("Number of rows in col_data does not match number of columns of data")
  if(! is.null(rownames(model_matrix)) &&
     ! all(rownames(model_matrix) == as.character(seq_len(nrow(model_matrix)))) && # That's the default rownames
     ! is.null(colnames(data))){
    if(! all(rownames(model_matrix) == colnames(data))){
      if(setequal(rownames(model_matrix), colnames(data))){
        # Rearrange the rows to match the columns of data
        model_matrix <- model_matrix[colnames(data), ,drop=FALSE]
      }else{
        stop("The rownames of the model_matrix / col_data do not match the column names of data.")
      }
    }

  }

  if(ncol(model_matrix) >= ncol(data)){
    stop("The model_matrix:\n", format_matrix(head(model_matrix)), "\nhas more columns (", ncol(model_matrix),
         ") than the there are samples in the data matrix\n",
         format_matrix(head(data[,seq_len(min(5, ncol(data))),drop=FALSE])),
         "\nwhich has ", ncol(data), " columns. ",
         "Too few replicates / too many coefficients to fit model.")
  }

  # Check rank of model_matrix
  qr_mm <- qr(model_matrix)
  if(qr_mm$rank < ncol(model_matrix) && n_samples > 0){
    is_zero_column <- DelayedMatrixStats::colCounts(model_matrix, value = 0) == nrow(model_matrix)
    if(any(is_zero_column)){
      stop("The model matrix:\n", format_matrix(head(model_matrix)), "\nseems degenerate ('matrix_rank(model_matrix) < ncol(model_matrix)'). ",
           "Column ", paste0(which(is_zero_column), collapse = ", "), " contains only zeros.")
    }else{
      stop("The model matrix:\n", format_matrix(head(model_matrix)), "\nseems degenerate ('matrix_rank(model_matrix) < ncol(model_matrix)'). ",
           "Some columns are perfectly collinear. Did you maybe include the same coefficient twice?")
    }
  }

  rownames(model_matrix) <- colnames(data)
  validate_model_matrix(model_matrix, data)
  list(model_matrix = model_matrix, design_formula = design_formula)
}



handle_subsample_parameter <- function(data, subsample){
  if(isFALSE(subsample)){
    n_subsamples <- ncol(data)
  }else if(isTRUE(subsample)){
    n_subsamples <- 1000
  }else{
    n_subsamples <- subsample
  }
  min(n_subsamples, ncol(data))
}




validate_model_matrix <- function(matrix, data){
  stopifnot(is.matrix(matrix))
  stopifnot(nrow(matrix) == ncol(data))
}



convert_chr_vec_to_model_matrix <- function(design, reference_level){
  if(! is.factor(design)){
    design_fct <- as.factor(design)
  }else{
    design_fct <- design
  }

  if(length(levels(design_fct)) == 1){
    # All entries are identical build an intercept only model
    mm <- matrix(1, nrow=length(design_fct), ncol=1)
    colnames(mm) <- levels(design_fct)
  }else if(is.null(reference_level)){
    helper_df <- data.frame(x_ = design_fct)
    mm <- stats::model.matrix.default(~ x_ - 1, helper_df)
    colnames(mm) <- sub("^x_", "", colnames(mm))
  }else{
    design_fct <- stats::relevel(design_fct, ref = reference_level)
    helper_df <- data.frame(x_ = design_fct)
    mm <- stats::model.matrix.default(~ x_ + 1, helper_df)
    colnames(mm)[-1] <- paste0(sub("^x_", "", colnames(mm)[-1]), "_vs_", reference_level)
  }
  colnames(mm)[colnames(mm) == "(Intercept)"] <- "Intercept"
  mm
}

extract_data_from_formula <- function(formula, col_data, encl = parent.frame()){
  if(length(formula) < 3){
    stop("The formula does not have a left hand side. Please call this function like this:\n",
         "'glm_gp(data = mat, design = ~ a + b + c, ...)'", call. = FALSE)
  }
  data <- eval(formula[[2]], envir = col_data, enclos = encl)
  formula[[2]] <- NULL
  list(data = data, design = formula)
}


convert_formula_to_model_matrix <- function(formula, col_data, reference_level=NULL){
  if(! is.null(reference_level)){
    has_ref_level <- vapply(col_data, function(x){
      any(!is.na(x) & x == reference_level)
    }, FUN.VALUE = FALSE)
    if(all(has_ref_level == FALSE)){
      stop("None of the columns contains the specified reference_level.")
    }
    col_data[has_ref_level] <- lapply(col_data[has_ref_level], function(col){
      if(is.character(col)){
        col <- as.factor(col)
      }
      stats::relevel(col, ref = reference_level)
    })
  }
  tryCatch({
    mm <- stats::model.matrix.default(formula, col_data)
  }, error = function(e){
    # Try to extract text from error message
    match <- regmatches(e$message, regexec("object '(.+)' not found", e$message))[[1]]
    if(length(match) == 2){
      stop("Object '", match[2], "' not found. Allowed variables are:\n",
           paste0(colnames(col_data), collapse = ", "))
    }else{
      stop(e$message)
    }
  })

  colnames(mm)[colnames(mm) == "(Intercept)"] <- "Intercept"
  mm
}


is_on_disk.glmGamPoi <- function(fit){
  is(fit$Mu, "DelayedMatrix") && is(DelayedArray::seed(fit$Mu), "HDF5ArraySeed")
}