title: "Simulating realistic microbial observations with SparseDOSSA2" author: - name: "Siyuan Ma" affiliation: - Harvard T.H. Chan School of Public Health - Broad Institute email: syma.research@gmail.com package: SparseDOSSA2 date: "12/01/2020" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{SparseDOSSA2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib
knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(cache = FALSE)
SparseDOSSA2
an R package for fitting to and the simulation of realistic microbial
abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables
for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2
model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities.
library(SparseDOSSA2) # tidyverse packages for utilities library(magrittr) library(dplyr) library(ggplot2)
SparseDOSSA2 is a Bioconductor package and can be installed via the following command.
# if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("SparseDOSSA2")
SparseDOSSA2
The most important functionality of SparseDOSSA2
is the simulation of
realistic synthetic microbial observations. To this end, SparseDOSSA2
provides
three pre-trained templates, "Stool"
, "Vaginal"
, and "IBD"
, targeting
continuous, discrete, and diseased population structures.
Stool_simulation <- SparseDOSSA2(template = "Stool", n_sample = 100, n_feature = 100, verbose = TRUE) Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", n_sample = 100, n_feature = 100, verbose = TRUE)
SparseDOSSA2 provide two functions, fit_SparseDOSSA2
and fitCV_SparseDOSSA2
,
to fit the SparseDOSSA2 model to microbial count or relative abundance observations.
For these functions, as input, SparseDOSSA2
requires a feature-by-sample
table of microbial abundance observations. We provide with SparseDOSSA2 a minimal
example of such a dataset: a five-by-five of the HMP1-II
stool study.
data("Stool_subset", package = "SparseDOSSA2") # columns are samples. Stool_subset[1:2, 1, drop = FALSE]
fit_SparseDOSSA2
fit_SparseDOSSA2
fits the SparseDOSSA2 model to estimate the
model parameters: per-feature prevalence, mean and standard deviation of
non-zero abundances, and feature-feature correlations.
It also estimates joint distribution of these parameters
and (if input is count) a read count distribution.
fitted <- fit_SparseDOSSA2(data = Stool_subset, control = list(verbose = TRUE)) # fitted mean log non-zero abundance values of the first two features fitted$EM_fit$fit$mu[1:2]
fitCV_SparseDOSSA2
The user can additionally achieve optimal model fitting via
fitCV_SparseDOSSA2
. They can either provide a vector of tuning parameter
values (lambdas
) to control sparsity in the estimation of the correlation
matrix parameter, or a grid will be selected automatically.
fitCV_SparseDOSSA2
uses cross validation to select an "optimal" model fit
across these tuning parameters via average testing log-likelihood. This is a
computationally intensive procedure, and best-suited for users that would like
accurate fitting to the input dataset, for best simulated new microbial
observations on the same features as the input (i.e. not new features).
```r
future
SparseDOSSA2
internally uses r BiocStyle::CRANpkg("future")
to allow forfuture
'sfuture
for more details. This isSparseDOSSA2
in a high-performance computing# Sessioninfo ```r sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.