R/model_flagging.R

Defines functions flag_model_coef_count flag_model_cat_count flag_model_inj_range flag_model_mean_residual flag_model_residual

Documented in flag_model_cat_count flag_model_coef_count flag_model_inj_range flag_model_mean_residual flag_model_residual

#' @title Functions to flag/exclude models
#'
#' @description
#'
#' The functions listed on this help page (`flag_*`) allow to flag/identify
#' potentially problematic model fits. The function were designed to be used
#' with the models estimating an injection-order-dependent signal drift seen
#' in LC-MS based untargeted metabolomics data. Such models can be fit with the
#' `rowFitModel` function of the `xcms` package. Functions are expected to
#' return `TRUE` for potentially problematic model fits and `FALSE` otherwise.
#'
#' The functions are:
#'
#' - `flag_model_residual`: test whether the difference between the difference
#'   between the 25 and 75% quantile of residuals is larger than the user
#'   defined value `diff_residual`. This function identifies model fits with
#'   on average large deviations of the individual data points from the fitted
#'   line.
#'
#' - `flag_model_mean_residual`: tests if the `mean` of the absolute residuals
#'   is larger than the provided value.
#' 
#' - `flag_model_inj_range`: tests if values on which the model was fitted
#'   spans a minimum required injection index range. This requires `x` being
#'   a model of the type `y ~ inj_idx`.
#'
#' - `flag_model_cat_count`: tests if the number of replicated measurents
#'   (categories) for a categorical variable (e.g. batch) are larger than
#'   `min_count`.
#'
#' - `flag_model_coef_count`: tests if the number of estimated coefficients
#'   matches the expected number. This is useful/required for linear models
#'   aimed to adjust a batch effect, but for which a coefficient was not
#'   estimated for each batch (=level of the categorical variable representing
#'   the batch). This could happen if only missing values were present for the
#'   respective batch.
#' 
#' @param column for `flag_model_inj_range`: `character(1)` specifying the
#'     column containing the injection index.
#' 
#' @param cut_off for `flag_model_mean_residual`: `numeric(1)` defining the
#'     cut-off to flag models with large average residuals.
#'
#' @param diff_residual for `flag_model_residual`: `numeric(1)` defining the
#'     cut-off to flag models with large residuals. `TRUE` is reported for all
#'     models with a difference between the 25 and 75 percent percentile of
#'     residuals larger than this value.
#'
#' @param min_range for `flag_model_inj_range`: `numeric(1)` defining the
#'     minimum range. This is an absolute value, not a percentage. Means, the
#'     function compares the `diff(range(x$x[, column]))` with `min_range` and
#'     flags models that don't fit that criteria (i.e. have a smaller range).
#'
#' @param min_count for `flag_model_cat_count`: `integer(1)` with the minimum
#'     required number of values within each category.
#'
#' @param n_coef for `flag_model_coef_count`: `integer(1)` with the expected
#'     number of coefficients.
#' 
#' @param variable for `flag_model_cat_count`: `character(1)` with the name of
#'     the categorical value. Should be one of `colnames(x$model)`.
#'
#' @param x a linear model object such as generated by [lm()] or
#'     [robustbase::lmrob()].
#' 
#' @return `logical(1)`: `TRUE` if model fit is potentially problematic and
#'     `FALSE` otherwise or `NA` if no model provided.
#'
#' @author Johannes Rainer
#'
#' @rdname model-flagging
#'
#' @importFrom stats quantile residuals
#' @export
flag_model_residual <- function(x, diff_residual = 1) {
    if (length(x) > 1) {
        if (diff(quantile(residuals(x), probs = c(0.25, 0.75))) > diff_residual)
            TRUE
        else FALSE
    } else NA
}

#' @rdname model-flagging
#'
#' @export
flag_model_mean_residual <- function(x, cut_off = 0.5) {
    if (length(x) > 1)
        mean(abs(residuals(x))) > cut_off
    else NA
}

#' @rdname model-flagging
#'
#' @export
flag_model_inj_range <- function(x, min_range = 1, column = "inj_idx") {
    if (length(x) > 1)
        diff(range(x$model[, column])) < min_range
    else NA
}

#' @rdname model-flagging
#'
#' @export
flag_model_cat_count <- function(x, variable, min_count = 4) {
    if (missing(variable))
        stop("'variable' is missing")
    if (length(x) > 1) {
        if (!any(colnames(x$model) == variable))
            stop("'variable' has to be one of ", paste0(colnames(x$model),
                                                        collapse = ", "))
        min(table(x$model[, variable])) < min_count
    }
    else NA
}

#' @rdname model-flagging
#'
#' @export
flag_model_coef_count <- function(x, n_coef) {
  if (missing(n_coef))
    stop("'n_coef' is missing")
  if (length(x) > 1) {
   length(x$coefficients) != n_coef
  }
  else NA
}
EuracBiomedicalResearch/CompMetaboTools documentation built on Jan. 31, 2024, 1:14 p.m.