knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], phyloseq = citation("phyloseq")[1], treesummarizedexperiment = citation("TreeSummarizedExperiment")[1], tidyverse = citation("tidyverse")[1], mixtools = citation("mixtools")[1], fitdistrplus = citation("fitdistrplus")[1], CBEA = citation("CBEA")[1], broom = citation("broom")[1], BiocParallel = citation("BiocParallel")[1] )
This package implements the CBEA approach for performing set-based enrichment analysis for microbiome relative abundance data. A preprint of the package can be found on bioXriv. In summary, CBEA (Competitive Balances for taxonomic Enrichment Analysis) provides an estimate of the activity of a set by transforming an input taxa-by-sample data matrix into a corresponding set-by-sample data matrix. The resulting output can be used for additional downstream analyses such as differential abundance, classification, clustering, etc. using set-based features instead of the original units.
The transformation that CBEA applies is based on the isometric log ratio transformation:
$$
CBEA_{i,\mathbb{S}} = \sqrt{\frac{|\mathbb{S}||\mathbb{S_c}|}{|\mathbb{S}| + |\mathbb{S_c}|}} \ln \frac{g(X_{i,j | j\in \mathbb{S}})}{g(X_{i,j | j \notin \mathbb{S}})}
$$
Where $\mathbb{S}$ is the set of interest, $\mathbb{S}_C$ is it's complement, $g()$ is the geometric mean operation, and $X$ is the original data matrix where $i$ is the index representing samples and $j$ is the index representing variables (or taxa).
The inference procedure is performed through estimating the null distribution of the test statistic. This can be done either via permutations or a parametric fit of a distributional form on the permuted scores. Users can also adjust for variance inflation due to inter-taxa correlation. Please refer to the main manuscript for any additional details.
CBEA
r Biocpkg("CBEA")
is an R
package available via the Bioconductor repository for packages. It requires installing the R
open source statistical programming language, which can be accessed on any operating system from CRAN. After which you can install r Biocpkg("CBEA")
by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("CBEA") ## Check that you have a valid Bioconductor installation BiocManager::valid()
If there are any issues with the installation procedure or package features, the best place would be to file an issue at the GitHub repository. For any additional support you can use the Bioconductor support site and use the CBEA
tag and check the older posts. Please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
library("CBEA") library(BiocSet) library(tidyverse) set.seed(1020)
First, we can load some pre-packaged data sets contained in r Biocpkg("CBEA")
. Here we're loading the data from the Human Microbiome Project (HMP) in TreeSummarizedExperiment
data container from the r Biocpkg("TreeSummarizedExperiment")
. This package does not support phyloseq
from the r Biocpkg("phyloseq")
package but users can leverage the r Biocpkg("mia")
package to convert between the data types.
In addition, users can also input raw matrices or data frames, however those require additional arguments. The taxa_are_rows
argument requires users specify whether the data frame/matrix has taxa abundances as rows (or as columns). The id_col
argument requires users to specify (for data frames only) a vector of names of row metadata that will be excluded from the analysis.
data("hmp_gingival") abun <- hmp_gingival$data metab_sets <- hmp_gingival$set abun # this is a TreeSummarizedExperiment object
CBEA accepts any type of sets, as long as it is in the BiocSet
format where the elements in the sets can be matched to taxa names in the data set. The main function will check if these names match.
metab_sets
For more information on BiocSet
, please refer to the documentation from r Biocpkg("BiocSet")
. However, simply speaking, BiocSet
acts similar to a list of three data frames and can be used in conjunction with r CRANpkg("dplyr")
/r CRANpkg("tidyr")
.
After specifying the inputs, cbea
is the main function to apply the method. If there are zeros in the abundance data, the cbea
will add a pseudocount to avoid issues with the log-ratio transformation (but will throw a warning). If a different zero-handling approach is desired, users should pre-process the abundance data with the appropriate method. For parametric fits, cbea
relies on the r CRANpkg("fitdistrplus")
and r CRANpkg("mixtools")
packages to estimate the parameters of the null. Specific arguments to control this fitting procedure can be provided as a named list in the control
argument.
Applying cbea
is one command:
results <- cbea(abun, set = metab_sets, abund_values = "16SrRNA", output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10) results
Some important arguments to control the behaviour of CBEA.
output
: This controls what type of output is being returned. CBEA usually estimates a parametric null and users can specify what they want in return. If users want to perform downstream analysis with set-level features, they can return CDF values or z-scores of each raw score computed against that distribution (options cdf
or zscore
). Alternatively, users can just return the raw scores themselves (no distribution fitting will be performed) using raw
as the option. Users can also use this distribution to estimate unadjusted p-values (option pval
) to see whether a set is enriched at each sample. These unadjusted p-values can be converted based on a threshold (based on thresh
which is default to be set at 0.05) into a dummy variable indicating enrichment (option sig
). Note: CDF values and Z-scores are not available for non-parametric null estimations. parametric
: This is a logical argument to specify whether a the null distribution will be specified via parametric fit or via non-parametric permutation testing. If parametric
is TRUE
, users need to specify distr
and adj
. If parametric
is FALSE
, users need to increase n_perm
. distr
: The form of the distribution if parametric fit is desired. As of now only supports norm
, mnorm
. adj
: Whether the distribution should be adjusted for variance inflation. This procedure is done by combining the mean estimate from scores computed from permuted data set and the variance estimate from raw scores (computed on the unpermuted data set). The output object is of class CBEAout
, which is an S3 object. The underlying data structure is a list of lists, where the outer lists represent different aspects of the output. For example R
represent the final scores while diagnostic
represent certain goodness-of-fit statistics.
names(results)
Within each aspect, there is a list of size equivalent to the total number of sets evaluated. For example, the results
object is of size 3 representing the evaluated sets.
str(results$R)
Users can use tidy
and glance
following the r CRANpkg("broom")
to process CBEAout
into nice objects. The tidy
function returns a tibble
of scores (samples by set). The glance
function returns some diagnostics. There are two options for the glance
function: fit_comparison
allows users to compare the l-moments of the data, the permuted data, and the final fitted distribution; fit_diagnostic
shows goodness-of-fit statistics of the distribution fitting procedure itself, with log-likelihoods and Anderson-Darling (column "ad") statistics.
tidy(results) glance(results, "fit_comparison") glance(results, "fit_diagnostic")
CBEA
has in-built capacity to perform calculations paralelled across the total number of sets. The engine for parallelization is r Biocpkg("BiocParallel")
. If NULL
, SerialParam
backend will be used.
BiocParallel::registered()
cbea(abun, set = metab_sets, abund_values = "16SrRNA", output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10, parallel_backend = MulticoreParam(workers = 2))
CBEA
We hope that r Biocpkg("CBEA")
will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info citation("CBEA")
The r Biocpkg("CBEA")
package r Citep(bib[["CBEA"]])
was made possible thanks to:
r Citep(bib[["R"]])
r Biocpkg("BiocStyle")
r Citep(bib[["BiocStyle"]])
r CRANpkg("knitr")
r Citep(bib[["knitr"]])
r CRANpkg("RefManageR")
r Citep(bib[["RefManageR"]])
r CRANpkg("rmarkdown")
r Citep(bib[["rmarkdown"]])
r CRANpkg("broom")
r Citep(bib[["broom"]])
r CRANpkg("sessioninfo")
r Citep(bib[["sessioninfo"]])
r CRANpkg("testthat")
r Citep(bib[["testthat"]])
r CRANpkg("mixtools")
r Citep(bib[["mixtools"]])
r CRANpkg("fitdistrplus")
r Citep(bib[["fitdistrplus"]])
r CRANpkg("tidyverse")
r Citep(bib[["tidyverse"]])
r Biocpkg("BiocSet")
r Citep(bib[["BiocSet"]])
r Biocpkg("phyloseq")
r Citep(bib[["phyloseq"]])
r Biocpkg("BiocParallel")
r Citep(bib[["BiocParallel"]])
This package was developed using r BiocStyle::Biocpkg("biocthis")
.
Code for creating the vignette
## Create the vignette library("rmarkdown") system.time(render("basic_usage.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("basic_usage.Rmd", tangle = TRUE)
Date the vignette was generated.
## Date the vignette was generated Sys.time()
Wallclock time spent generating the vignette.
## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3)
R
session information.
## Session info library("sessioninfo") options(width = 120) session_info()
This vignette was generated using r Biocpkg("BiocStyle")
r Citep(bib[["BiocStyle"]])
with r CRANpkg("knitr")
r Citep(bib[["knitr"]])
and r CRANpkg("rmarkdown")
r Citep(bib[["rmarkdown"]])
running behind the scenes.
Citations made with r CRANpkg("RefManageR")
r Citep(bib[["RefManageR"]])
.
## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.