knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(IFAA)

IFAA offers a robust approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. High-dimensional covariates are handled with regularization. The estimates of this method have easy interpretation like a typical regression analysis. High-dimensional covariates are handled with regularization and it is implemented by parallel computing. False discovery rate is automatically controlled by this approach. Zeros do not need to be imputed by a positive value for the analysis. The IFAA package also offers the 'MZILN' function for estimating and testing associations of abundance ratios with covariates.

To model the association, the following equation is used: $$ \log(\mathcal{Y}_i^k)|\mathcal{Y}_i^k>0=\beta^{0k}+X_i^T\beta^k+W_i^T\gamma^k+Z_i^Tb_i+\epsilon_i^k,\hspace{0.2cm}k=1,...,K+1, $$ where

The challenge in microbiome analysis is that we can not oberve $\mathcal{Y}_i^k$. What is observed is its small proportion: $Y_i^k=C_i\mathcal{Y}^k_i$ where $C_i$ is an unknown number between 0 and 1 that denote the observed proportion. The IFAA method successfuly addressed this challenge.

Package installation

To install, follow the instruction on Bioconductor. The package could be also installed from GitHub using the following code:

require(devtools)
devtools::install_github("gitlzg/IFAA")

Input for IFAA() function

Most of the time, users just need to feed these three inputs to the function: experiment_dat, testCov and ctrlCov. All other inputs can just take their default values. Below are all the inputs of the functions

Output for IFAA() function

A list containing 2 elements

Example

The example generates an example data from scratch, with 10 taxon, 40 subjects, and 3 covariates.

library(IFAA)

set.seed(1)

## create an ID variable for the example data
ID <- seq_len(20)

## generate three covariates x1, x2, and x3, with x2 binary
x1 <- rnorm(20)
x2 <- rbinom(20, 1, 0.5)
x3 <- rnorm(20)
dataC <- data.frame(cbind(ID, x1, x2, x3))

## Coefficients for x1, x2, and x3 among 10 taxa.
beta_1 <- c(0.1, rep(0, 9))
beta_2 <- c(0, 0.2, rep(0, 8))
beta_3 <- rnorm(10)
beta_mat <- cbind(beta_1, beta_2, beta_3)

## Generate absolute abundance for 10 taxa in ecosystem.
dataM_eco <- floor(exp(10 + as.matrix(dataC[, -1]) %*% t(beta_mat) + rnorm(200, sd = 0.05)))

## Generate sequence depth and generate observed abundance
Ci <- runif(20, 0.01, 0.05)
dataM <- floor(apply(dataM_eco, 2, function(x) x * Ci))
colnames(dataM) <- paste0("rawCount", 1:10)

## Randomly introduce 0 to make 25% sparsity level.
dataM[sample(seq_len(length(dataM)), length(dataM) / 4)] <- 0

dataM <- data.frame(cbind(ID, dataM))

## The following steps are to create a SummarizedExperiment object.
## If you already have a SummarizedExperiment format data, you can
## ignore the following steps and directly feed it to the IFAA function.

## Merge two dataset by ID variable
data_merged <- merge(dataM, dataC, by = "ID", all = FALSE)

## Seperate microbiome data and covariate data, drop ID variable from microbiome data
dataM_sub <- data_merged[, colnames(dataM)[!colnames(dataM) %in% c("ID")]]
dataC_sub <- data_merged[, colnames(dataC)]

## Create SummarizedExperiment object
test_dat <- SummarizedExperiment::SummarizedExperiment(
  assays = list(MicrobData = t(dataM_sub)), colData = dataC_sub
)

If you already have a SummarizedExperiment format data, you can ignore the above steps for creating a SummarizedExperiment object. In the generated data, rawCount1 is associated with x1, rawCount2 is associated with x2, and the sparsity level is 25%. Next we analyze the data to test the association between microbiome and the variable "x1", "x2" while adjusting for the variable (potential confounder) "x3".

set.seed(123) # For full reproducibility
results <- IFAA(
  experiment_dat = test_dat,
  testCov = c("x1", "x2"),
  ctrlCov = c("x3"),
  sampleIDname = "ID",
  fdrRate = 0.05,
  nRef = 2,
  paraJobs = 2
)

Notice that this example used "nRef = 2" and "paraJobs = 2" for the sake of speed in this example. Typically one should use the default values for those two arguments.

In this example, we are only interested in testing the associations with "x1" and "x2" which is why testCov=c("x1","x2). The variables "x3" are adjusted as potential confounders in the analyses. The final analysis results are saved in the list full_result and the significant results can be extracted as follows:

summary_res <- results$full_results
sig_results <- subset(summary_res, sig_ind == TRUE)
sig_results

After adjusting for multiple testing, the results found "rawCount1" associated with "x1", and "rawCount2" associated with "x2", while adjusting for "x3". The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to $\beta^k$ in the model equation.

The interpretation is:

If only interested in certain taxa, say "rawCount1", "rawCount2", and "rawCount3", one can do:

set.seed(123) # For full reproducibility

results <- IFAA(
  experiment_dat = test_dat,
  microbVar = c("rawCount1", "rawCount2", "rawCount3"),
  testCov = c("x1", "x2"),
  ctrlCov = c("x3"),
  sampleIDname = "ID",
  fdrRate = 0.05,
  nRef = 2,
  paraJobs = 2
)

Again, notice that this example used "nRef = 2" and "paraJobs = 2" for the sake of speed in this example. Typically one should use the default values for those two arguments.

Reference

Li et al.(2021) IFAA: Robust association identification and Inference For Absolute Abundance in microbiome analyses. Journal of the American Statistical Association. 116(536):1595-1608

MZILN() function

The IFAA package can also implement the Multivariate Zero-Inflated Logistic Normal (MZILN) regression model for estimating and testing the association of abundance ratios with covariates. The MZILN() function estimates and tests the associations of user-specified abundance ratios with covariates. When the denominator taxon of the ratio is independent of the covariates, 'MZILN()' should generate similar results as 'IFAA()'. The regression model of 'MZILN()' can be expressed as follows: $$ \log\bigg(\frac{\mathcal{Y}_i^k}{\mathcal{Y}_i^{K+1}}\bigg)|\mathcal{Y}_i^k>0,\mathcal{Y}_i^{K+1}>0=\alpha^{0k}+\mathcal{X}_i^T\alpha^k+\epsilon_i^k,\hspace{0.2cm}k=1,...,K, $$ where

Input for MZILN() function

Most of the time, users just feed the first four inputs to the function: experiment_dat, microbVar, refTaxa and allCov. All other inputs can just take their default values. All the inputs for 'MZILN()' are:

Output for MZILN() function

A list with two elements:

Examples

The example used data generated from IFAA example, with 10 taxon, 40 subjects, and 3 covariates.

Next we analyze the data to test the associations between the ratios "rawCount5/rawCount10" and "rawCount8/rawCount10" and all the three variables "x1", "x2" and "x3" in a multivariate model where all "x1", "x2" and "x3" are independent variables simultaneously.

set.seed(123) # For full reproducibility

resultsRatio <- MZILN(
  experiment_dat = test_dat,
  microbVar = c("rawCount5","rawCount8"),
  refTaxa = c("rawCount10"),
  allCov = c("x1", "x2", "x3"),
  sampleIDname = "ID",
  fdrRate = 0.05,
  paraJobs = 2
)

Again, notice that this example used "nRef = 2" and "paraJobs = 2" for the sake of speed in this example. Typically one should use the default values for those two arguments.

The final analysis results can be extracted as follows:

summary_res_ratio <- resultsRatio$full_results
summary_res_ratio

The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to $\alpha^k$ in the model equation.

The results show that the ratio "rawCount5/rawCount10" is significantly associated with "x2" and "x3" and the ratio "rawCount8/rawCount10" is significantly associated with "x3" after multiple testing adjustment. The interpretation is:

If the purpose is to explore all taxa ratios with "rawCount10" being the denominator, one can do:

set.seed(123) # For full reproducibility
resultsAllRatio <- MZILN(
  experiment_dat = test_dat,
  refTaxa = c("rawCount10"),
  allCov = c("x1", "x2", "x3"),
  sampleIDname = "ID",
  fdrRate = 0.05,
  paraJobs = 2
)

Again, notice that this example used "nRef = 2" and "paraJobs = 2" for the sake of speed in this example. Typically one should use the default values for those two arguments.

The results for the ratio "rawCount5/rawCount10" can extracted as follows:

summary_res_ratio <- resultsAllRatio$full_results
summary_res_ratio[summary_res_ratio$taxon == "rawCount5", , drop = FALSE]

The interpretation of the association estimates remain the same. Notice that the significance dropped for the association between the ratio "rawCount5/rawCount10" and "x2" due to the correction for multiple testing (i.e., 0.2025>0.05). It is because this exploratory analysis included all microbiome variables and thus the p-value adjustment is more stringent for multiple testing which led to the larger adjusted p-value.

The results for the ratio "rawCount8/rawCount10" can be extracted similarly.

To extract all significant ratios with "rawCount10" being the denominator,

one can do: subset(summary_res_ratio,sig_ind==TRUE).

Reference

Li et al.(2018) Conditional Regression Based on a Multivariate Zero-Inflated Logistic-Normal Model for Microbiome Relative Abundance Data. Statistics in Biosciences 10(3): 587-608

Session Info

sessionInfo()


gitlzg/IFAA documentation built on Jan. 26, 2024, 10:26 p.m.