#' @title Analysis of Composition of Microbiomes (ANCOM)
#'
#' @description Determine taxa whose absolute abundances, per unit volume, of
#' the ecosystem (e.g. gut) are significantly different with changes in the
#' covariate of interest (e.g. group). The current version of
#' \code{ancom} function implements ANCOM in cross-sectional and repeated
#' measurements data while allowing for covariate adjustment.
#'
#' @details A taxon is considered to have structural zeros in some (>=1)
#' groups if it is completely (or nearly completely) missing in these groups.
#' For instance, suppose there are three groups: g1, g2, and g3.
#' If the counts of taxon A in g1 are 0 but nonzero in g2 and g3,
#' then taxon A will be considered to contain structural zeros in g1.
#' In this example, taxon A is declared to be differentially abundant between
#' g1 and g2, g1 and g3, and consequently, it is globally differentially
#' abundant with respect to this group variable.
#' Such taxa are not further analyzed using ANCOM, but the results are
#' summarized in the overall summary. For more details about the structural
#' zeros, please go to the
#' \href{https://doi.org/10.3389/fmicb.2017.02114}{ANCOM-II} paper.
#' Setting \code{neg_lb = TRUE} indicates that you are using both criteria
#' stated in section 3.2 of
#' \href{https://doi.org/10.3389/fmicb.2017.02114}{ANCOM-II}
#' to detect structural zeros; otherwise, the algorithm will only use the
#' equation 1 in section 3.2 for declaring structural zeros. Generally, it is
#' recommended to set \code{neg_lb = TRUE} when the sample size per group is
#' relatively large (e.g. > 30).
#'
#' @param data the input data. The \code{data} parameter should be either a
#' \code{matrix}, \code{data.frame}, \code{phyloseq} or a \code{TreeSummarizedExperiment}
#' object. Both \code{phyloseq} and \code{TreeSummarizedExperiment} objects
#' consist of a feature table (microbial count table), a sample metadata table,
#' a taxonomy table (optional), and a phylogenetic tree (optional).
#' If a \code{matrix} or \code{data.frame} is provided, ensure that the row
#' names of the \code{metadata} match the sample names (column names if
#' \code{taxa_are_rows} is TRUE, and row names otherwise) in \code{data}.
#' if a \code{phyloseq} or a \code{TreeSummarizedExperiment} is used, this
#' standard has already been enforced. For detailed information, refer to
#' \code{?phyloseq::phyloseq} or
#' \code{?TreeSummarizedExperiment::TreeSummarizedExperiment}.
#' It is recommended to use low taxonomic levels, such as OTU or species level,
#' as the estimation of sampling fractions requires a large number of taxa.
#' @param taxa_are_rows logical. Whether taxa are positioned in the rows of the
#' feature table. Default is TRUE.
#' It is recommended to use low taxonomic levels, such as OTU or species level,
#' as the estimation of sampling fractions requires a large number of taxa.
#' @param assay_name character. Name of the count table in the data object
#' (only applicable if data object is a \code{(Tree)SummarizedExperiment}).
#' Default is "counts".
#' See \code{?SummarizedExperiment::assay} for more details.
#' @param assay.type alias for \code{assay_name}.
#' @param tax_level character. The taxonomic or non taxonomic(rowData) level of interest. The input data
#' can be analyzed at any taxonomic or rowData level without prior agglomeration.
#' Note that \code{tax_level} must be a value from \code{taxonomyRanks} or \code{rowData}, which
#' includes "Kingdom", "Phylum" "Class", "Order", "Family" "Genus" "Species" etc.
#' See \code{?mia::taxonomyRanks} for more details.
#' Default is NULL, i.e., do not perform agglomeration, and the
#' ANCOM-BC2 analysis will be performed at the lowest taxonomic level of the
#' input \code{data}.
#' @param rank alias for \code{tax_level}.
#' @param aggregate_data The abundance data that has been aggregated to the desired
#' taxonomic level. This parameter is required only when the input data is in
#' \code{matrix} or \code{data.frame} format. For \code{phyloseq} or \code{TreeSummarizedExperiment}
#' data, aggregation is performed by specifying the \code{tax_level} parameter.
#' @param meta_data a \code{data.frame} containing sample metadata.
#' This parameter is mandatory when the input \code{data} is a generic
#' \code{data.frame}. Ensure that the row names of the \code{metadata} match the
#' sample names (column names if \code{taxa_are_rows} is TRUE, and row names
#' otherwise) in \code{data}.
#' @param p_adj_method character. method to adjust p-values. Default is "holm".
#' Options include "holm", "hochberg", "hommel", "bonferroni", "BH", "BY",
#' "fdr", "none". See \code{?stats::p.adjust} for more details.
#' @param prv_cut a numerical fraction between 0 and 1. Taxa with prevalences
#' (the proportion of samples in which the taxon is present)
#' less than \code{prv_cut} will be excluded in the analysis. For example,
#' if there are 100 samples, and a taxon has nonzero counts present in less than
#' 100*prv_cut samples, it will not be considered in the analysis.
#' Default is 0.10.
#' @param lib_cut a numerical threshold for filtering samples based on library
#' sizes. Samples with library sizes less than \code{lib_cut} will be
#' excluded in the analysis. Default is 0, i.e. do not discard any sample.
#' @param main_var character. The name of the main variable of interest.
#' @param adj_formula character string representing the formula for
#' covariate adjustment. Please note that you should NOT include the
#' \code{main_var} in the formula. Default is \code{NULL}.
#' @param rand_formula the character string expresses how the microbial absolute
#' abundances for each taxon depend on the random effects in metadata. ANCOM
#' follows the \code{lmerTest} package in formulating the random effects. See
#' \code{?lmerTest::lmer} for more details. Default is \code{NULL}.
#' @param lme_control a list of control parameters for mixed model fitting.
#' See \code{?lme4::lmerControl} for details.
#' @param struc_zero logical. whether to detect structural zeros based on
#' \code{main_var}. \code{main_var} should be discrete. Default is FALSE.
#' @param neg_lb logical. whether to classify a taxon as a structural zero using
#' its asymptotic lower bound. Default is FALSE.
#' @param alpha numeric. level of significance. Default is 0.05.
#' @param n_cl numeric. The number of nodes to be forked. For details, see
#' \code{?parallel::makeCluster}. Default is 1 (no parallel computing).
#' @param verbose logical. Whether to display detailed progress messages.
#'
#' @return a \code{list} with components:
#' \itemize{
#' \item{ \code{res}, a \code{data.frame} containing ANCOM
#' result for the variable specified in \code{main_var},
#' each column is:}
#' \itemize{
#' \item{ \code{W}, test statistics.}
#' \item{ \code{detected_0.9, detected_0.8, detected_0.7, detected_0.6},
#' logical vectors representing whether a taxon is differentially
#' abundant under a series of cutoffs. For example, TRUE in
#' \code{detected_0.7} means the number of ALR transformed models where
#' the taxon is differentially abundant with regard to the main variable
#' outnumbers \code{0.7 * (n_tax - 1)}. \code{detected_0.7} is commonly
#' used. Choose \code{detected_0.8} or \code{detected_0.9} for more
#' conservative results, or choose \code{detected_0.6} for more liberal
#' results.}
#' }
#' \item{ \code{zero_ind}, a logical \code{data.frame} with TRUE
#' indicating the taxon is detected to contain structural zeros in
#' some specific groups.}
#' \item{ \code{beta_data}, a numeric \code{matrix} containing pairwise
#' coefficients for the main variable of interest in ALR transformed
#' regression models.}
#' \item{ \code{p_data}, a numeric \code{matrix} containing pairwise
#' p-values for the main variable of interest in ALR transformed
#' regression models.}
#' \item{ \code{q_data}, a numeric \code{matrix} containing adjusted
#' p-values by applying the \code{p_adj_method} to the \code{p_data}
#' matrix.}
#' }
#'
#' @seealso \code{\link{ancombc}} \code{\link{ancombc2}}
#'
#' @examples
#' library(ANCOMBC)
#' if (requireNamespace("microbiome", quietly = TRUE)) {
#' data(atlas1006, package = "microbiome")
#' # subset to baseline
#' pseq = phyloseq::subset_samples(atlas1006, time == 0)
#'
#' # run ancom function
#' set.seed(123)
#' out = ancom(data = pseq, tax_level = "Family",
#' p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000,
#' main_var = "bmi_group", adj_formula = "age + nationality",
#' rand_formula = NULL, lme_control = NULL,
#' struc_zero = TRUE, neg_lb = TRUE, alpha = 0.05, n_cl = 1)
#'
#' res = out$res
#' } else {
#' message("The 'microbiome' package is not installed. Please install it to use this example.")
#' }
#'
#' @author Huang Lin
#'
#' @references
#' \insertRef{mandal2015analysis}{ANCOMBC}
#'
#' \insertRef{kaul2017analysis}{ANCOMBC}
#'
#' @rawNamespace import(stats, except = filter)
#' @importFrom lmerTest lmer
#' @importFrom lme4 lmerControl
#' @importFrom parallel makeCluster stopCluster
#' @importFrom foreach foreach %dopar% registerDoSEQ
#' @importFrom doParallel registerDoParallel
#' @importFrom Rdpack reprompt
#'
#' @export
ancom = function(data = NULL, taxa_are_rows = TRUE,
assay.type = NULL, assay_name = "counts",
rank = NULL, tax_level = NULL,
aggregate_data = NULL, meta_data = NULL,
p_adj_method = "holm", prv_cut = 0.10,
lib_cut = 0, main_var, adj_formula = NULL,
rand_formula = NULL, lme_control = lme4::lmerControl(),
struc_zero = FALSE, neg_lb = FALSE,
alpha = 0.05, n_cl = 1, verbose = TRUE){
message("'ancom' has been fully evolved to 'ancombc2'. \n",
"Explore the enhanced capabilities of our refined method!")
# 1. Data pre-processing
# Data sanity check
check_results = data_sanity_check(data = data,
taxa_are_rows = taxa_are_rows,
assay.type = assay_name,
assay_name = assay_name,
rank = tax_level,
tax_level = tax_level,
aggregate_data = aggregate_data,
meta_data = meta_data,
fix_formula = paste(main_var, adj_formula, sep = " + "),
group = main_var,
struc_zero = struc_zero,
verbose = verbose)
# Filter data by prevalence and library size
core = .data_core(data = check_results$feature_table_aggregate,
meta_data = check_results$meta_data,
prv_cut = prv_cut, lib_cut = lib_cut,
tax_keep = NULL, samp_keep = NULL)
feature_table = core$feature_table
tax_keep = core$tax_keep
tax_name = rownames(feature_table)
meta_data = core$meta_data
meta_data[] = lapply(meta_data, function(x)
if(is.factor(x)) factor(x) else x)
# Check the type of main variable
main_val = meta_data[, main_var]
main_class = class(main_val)
if (main_class == "character") {
if (length(unique(main_val)) == length(main_val)) {
warn_txt = sprintf(paste("The class of main varible is:",
main_class,
"but it contains too many categories.",
"Perhaps it should be numeric?",
sep = "\n"))
warning(warn_txt, call. = FALSE)
} else if (length(unique(main_val)) > 2) {
main_cat = 1
} else if (length(unique(main_val)) == 2) {
main_cat = 0
} else {
stop_txt = sprintf(paste("The class of main varible is:",
main_class,
"but it contains < 2 categories",
sep = "\n"))
stop(stop_txt, call. = FALSE)
}
} else if (main_class == "factor") {
if (nlevels(main_val) == length(main_val)) {
warn_txt = sprintf(paste("The class of main varible is:",
main_class,
"but it contains too many categories.",
"Perhaps it should be numeric?",
sep = "\n"))
warning(warn_txt, call. = FALSE)
} else if (nlevels(main_val) > 2) {
main_cat = 1
} else if (nlevels(main_val) == 2) {
main_cat = 0
} else {
stop_txt = sprintf(paste("The class of main varible is:",
main_class,
"but it contains < 2 categories",
sep = "\n"))
stop(stop_txt, call. = FALSE)
}
} else {
main_cat = 0
}
# Check the number of taxa
n_tax = nrow(feature_table)
if (n_tax < 10) {
warn_txt = sprintf(paste("ANCOM results would be unreliable when the number of taxa is too small (e.g. < 10)",
"The number of taxa after filtering is: ",
n_tax, sep = "\n"))
warning(warn_txt, call. = FALSE)
}
# 2. Identify taxa with structural zeros
if (verbose) {
message("Identifing taxa with structural zeros ...")
}
if (struc_zero) {
if (! main_class %in% c("character", "factor")) {
stop_txt = sprintf(paste("The main variable should be discrete for detecting structural zeros.",
"Otherwise, set struc_zero = FALSE to proceed"))
stop(stop_txt, call. = FALSE)
}
zero_ind = .get_struc_zero(data = check_results$feature_table_aggregate,
meta_data = check_results$meta_data,
group = main_var, neg_lb = neg_lb)
zero_ind = zero_ind[tax_keep, ]
rownames(zero_ind) = NULL
num_struc_zero = apply(zero_ind[, -1], 1, sum)
comp_idx = which(num_struc_zero == 0)
comp_table = feature_table[comp_idx, ]
}else{
zero_ind = NULL
comp_table = feature_table
}
# Add pseudo-count (1) and take logarithm
comp_table = log(as.matrix(comp_table) + 1)
n_tax = dim(comp_table)[1]
taxon_id = rownames(comp_table)
n_samp = dim(comp_table)[2]
# 3. Determine the type of statistical test and its formula
if (verbose) {
message("Determining the type of statistical test and its formula ...")
}
if (is.null(rand_formula)) {
# Model: linear model
tfun = stats::lm
# Formula
if (is.null(adj_formula)) {
tformula = formula(paste("x ~", main_var), sep = " ")
}else {
tformula = formula(paste("x ~", main_var, "+", adj_formula), sep = " ")
}
}else if (!is.null(rand_formula)) {
# Model: linear mixed-effects model
tfun = lmerTest::lmer
# Formula
if (is.null(adj_formula)) {
# Random intercept model
tformula = formula(paste0("x ~ ", main_var, "+ ", rand_formula))
}else {
# Random coefficients/slope model
tformula = formula(paste0("x ~ ", main_var, "+ ",
adj_formula, "+ ", rand_formula))
}
}
# 4. Computing pairwise p-values and effect sizes
if (verbose) {
message("Computing pairwise p-values and effect sizes ...")
}
comb = function(...) {
mapply("rbind", ..., SIMPLIFY = FALSE)
}
idx1 = NULL
if (n_cl > 1) {
cl = parallel::makeCluster(n_cl)
doParallel::registerDoParallel(cl)
} else {
foreach::registerDoSEQ()
}
if (main_cat == 0) {
result = foreach(idx1 = seq_len(n_tax), .combine = comb, .multicombine = TRUE) %dopar% {
alr_data = apply(comp_table, 1, function(x) x - comp_table[idx1, ])
alr_data = cbind(alr_data, meta_data)
p_vec = rep(NA, n_tax)
beta_vec = rep(NA, n_tax)
idx2 = NULL
if (is.null(rand_formula)) {
for (idx2 in seq_len(n_tax)) {
test_data = data.frame(x = alr_data[, idx2],
meta_data,
check.names = FALSE)
lm_fit = suppressWarnings(tfun(tformula, data = test_data))
# The main variable is on the second row
p_vec[idx2] = summary(lm_fit)$coef[2, "Pr(>|t|)"]
beta_vec[idx2] = summary(lm_fit)$coef[2, "t value"]
}
} else {
for (idx2 in seq_len(n_tax)) {
test_data = data.frame(x = alr_data[, idx2],
meta_data,
check.names = FALSE)
lme_fit = try(tfun(formula = tformula,
data = test_data,
na.action = na.omit,
control = lme_control),
silent = TRUE)
if (inherits(lme_fit, "try-error")) {
p_vec[idx2] = NA
beta_vec[idx2] = NA
} else {
summary_fit = summary(lme_fit)
# The main variable is on the second row
p_vec[idx2] = summary_fit$coefficients[2, "Pr(>|t|)"]
beta_vec[idx2] = summary_fit$coefficients[2, "Estimate"]
}
}
}
list(p_vec, beta_vec)
}
} else {
result = foreach(idx1 = seq_len(n_tax), .combine = comb, .multicombine = TRUE) %dopar% {
alr_data = apply(comp_table, 1, function(x) x - comp_table[idx1, ])
alr_data = cbind(alr_data, meta_data)
p_vec = rep(NA, n_tax)
beta_vec = rep(NA, n_tax)
idx2 = NULL
if (is.null(rand_formula)) {
for (idx2 in seq_len(n_tax)) {
test_data = data.frame(x = alr_data[, idx2],
meta_data,
check.names = FALSE)
lm_fit = suppressWarnings(tfun(tformula, data = test_data))
p_vec[idx2] = anova(lm_fit)[main_var, "Pr(>F)"]
beta_vec[idx2] = anova(lm_fit)[main_var, "F value"]
}
} else {
for (idx2 in seq_len(n_tax)) {
test_data = data.frame(x = alr_data[, idx2],
meta_data,
check.names = FALSE)
lme_fit = try(tfun(formula = tformula,
data = test_data,
na.action = na.omit,
control = lme_control),
silent = TRUE)
if (inherits(lme_fit, "try-error")) {
p_vec[idx2] = NA
beta_vec[idx2] = NA
} else {
p_vec[idx2] = anova(lme_fit)[main_var, "Pr(>F)"]
beta_vec[idx2] = anova(lme_fit)[main_var, "F value"]
}
}
}
list(p_vec, beta_vec)
}
}
if (n_cl > 1) {
parallel::stopCluster(cl)
}
p_data = result[[1]]
beta_data = result[[2]]
colnames(p_data) = taxon_id
rownames(p_data) = taxon_id
colnames(beta_data) = taxon_id
rownames(beta_data) = taxon_id
diag(p_data) = 1
p_data[is.na(p_data)] = 1
diag(beta_data) = 0
beta_data[is.na(beta_data)] = 0
# 5. Primary results
if (verbose) {
message("Primary results ...")
}
q_data = apply(p_data, 2, function(x) p.adjust(x, method = p_adj_method))
W = apply(q_data, 2, function(x) sum(x <= alpha))
res_comp = data.frame(taxon = taxon_id, W, row.names = NULL, check.names = FALSE)
res_comp$detected_0.9 = ifelse(res_comp$W > 0.9 * (n_tax - 1), TRUE, FALSE)
res_comp$detected_0.8 = ifelse(res_comp$W > 0.8 * (n_tax - 1), TRUE, FALSE)
res_comp$detected_0.7 = ifelse(res_comp$W > 0.7 * (n_tax - 1), TRUE, FALSE)
res_comp$detected_0.6 = ifelse(res_comp$W > 0.6 * (n_tax - 1), TRUE, FALSE)
# Combine the information of structural zeros
if (struc_zero){
res = data.frame(taxon = tax_name, W = Inf,
detected_0.9 = TRUE, detected_0.8 = TRUE,
detected_0.7 = TRUE, detected_0.6 = TRUE,
row.names = NULL, check.names = FALSE)
res[comp_idx, ] = res_comp
}else{
res = res_comp
}
out = list(res = res, zero_ind = zero_ind,
beta_data = beta_data, p_data = p_data, q_data = q_data)
return(out)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.