Multiple strategies such as feature removal or modification exist to handle zero values from Omics data, whether for count-based such as RNA-seq or zero values from mass spectrometry or array-based data. The following literature is recommended to familiarize with zero handling in the analysis of Omics experiments during data analysis: Tarazona. Nat Comm, 2020; Quinn. GigaScience, 2019; Silverman. eLife, 2017. As a reminder, zero handling should be individually considered by the end-users during the pre-processing steps of their respective Omics data collection before using OBIF. As a reminder, the initial input for OBIF an analysis-ready data matrix m without negative or zero values which allows.
OBIF was deliberately developed using a threshold approach where feature removal was applied for zero handling of zero-containing features in a dataset, if needed. This thresholding allows uniform handling of all matrix-ready datasets once they are in the pipeline by the normD function, regardless of platform. Within the package, the normD and obint functions allow users to evaluate the impact of each transformation step on data distribution and interaction analysis when modeling their data with OBIF before committing to any zero handling strategy during preferred in their pre-processing steps, or during data scaling within OBIF. Additionally, to address any zero values that escape user detection in pre-processing, our package already includes a thresh function to remove features containing zero values across all samples for raw counts, RPKM, FPKM or TPM. (This function also allows end users to modify the threshold value.)
To guide the user in the analysis of zero-containing datasets, we will demonstrate below the analytic comparison of a microbiome dataset containing zero counts using negative binominal strategies without zero handling using DESeq2 and the thresholded data scaling strategy using OBIF.
Loading packages
library(DESeq2) library(OBIF) library(ClassComparisson)
Create working datasets before and after zero-handling In this example, we imported Malik et al, 2020’s microbiome sequence data that contains zero counts.
data.orig <- read_excel("~/Documents/R/MA GSE148618 Microbiome/MA GSE148618 Microbiome (Reduced x Shrub).xlsx") data <- fExpr(Data = data.orig, colnum1 = 5, colnum2 = 20, n.ctrl = 4, n.facA = 4, n.facB = 4, n.facAB = 4) meta.data <- metaD(Data = data.orig, colnum1 = 5, colnum2 = 20) feat.IDs <- ftIDs(Data = data.orig, colnum = 4, method = 2)
factor.A <- "Reduced" factor.B <- "Shrub" factor.AB <- "Reduced + Shrub"
row.names(data) <- feat.IDs colnames(data) <- coldata
si.orig <- data.frame(Factors=sapply(colnames(data),function(x)substr(x,1,regexpr('\.',x)[1]-1))) si.orig$Factors <- factor(si.orig$Factors, levels=c('ctrl','facA','facB','facAB')) si <- si.orig design <- model.matrix(~1+si$Factors) colnames(design) <- levels(si$Factors)
samples.col <- data.frame( "samples" = coldata ) col.data <- cbind(samples.col,si2x2, si) dds.AB <- DESeqDataSetFromMatrix(countData = data, colData = col.data, design= ~ 1 + Factors) dds.AB <- DESeq(dds.AB) resultsNames(dds.AB) # lists the coefficients res.AB <- results(dds.AB)
res.AB.true <- results(dds.AB, contrast = c("Factors", "ctrl", "facAB")) # results for DEMs for factor AB res.AB.inter <- results(dds.AB, contrast = c(0,-1,-1,1)) # results for the interaction effect at feature level
pval.des2.AB <- res.AB.true@listData[["pvalue"]] # p-values of expression analysis for factor AB bum.pval.des2.AB <- Bum(pval.des2.AB) # performance evaluation of expression pval.des2.AB.inter <- res.AB.inter@listData[["pvalue"]] # p-values of contrast for interaction effect bum.pval.des2.AB.inter <- Bum(pval.des2.AB.inter) # performance evaluation from contrast analysis
data.m <- tresh(data, 1) # Thresholding: Feature removal of zero values
obif.VP <- normD(data.m, "VP", col.samples) obif.DP <- normD(data.m, "DP", col.samples) obif.BP <- normD(data.m, "BP", col.samples)
data.m.0 <- dataT(data.m, 0) data.m.2 <- dataT(data.m, 2) data.m.4 <- dataT(data.m, 4) data.m.6 <- dataT(data.m, 6) data.m.7 <- dataT(data.m, 7)
obint(dataAR = data.m.0, design2x2 = design2x2, result = "modelsum") obint(dataAR = data.m.2, design2x2 = design2x2, result = "modelsum") obint(dataAR = data.m.6, design2x2 = design2x2, result = "modelsum")
design <- nTool(param = "design", n.ctrl = 4, n.facA = 4, n.facB = 4, n.facAB = 4)
contrast <- nTool(param = "contrast", n.ctrl = 4, n.facA = 4, n.facB = 4, n.facAB = 4)
si <- nTool(param = "si", n.ctrl = 4, n.facA = 4, n.facB = 4, n.facAB = 4)
obffa.res <- obffa(dataAR = data.m.2, design = design, contrast = contrast)
pval.obif.AB.data <- obffa.res$pval.exp.facAB # p-values of expression analysis for factor AB bum.pval.obif.AB <- Bum(pval.obif.AB.data) # performance evaluation of expression pval.obif.AB.inter <- obffa.res$pval.exp.facAB # p-values of contrast for interaction effect bum.pval.obif.AB.interr <- Bum(pval.obif.AB.inter) # performance evaluation from contrast analysis
f <- summary(bum.pval.obif.AB, tau = 0.05) # Substitute the “bum.pval.”- element accordingly f.e <- f@estimates FDP.f.e <- f.e$FP/(f.e$TP+f.e$FP) prec.f.e <- f.e$TP/(f.e$TP+f.e$FP) rec.f.e <- f.e$TP/(f.e$TP+f.e$FN)
image(bum.pval.des2.AB) # Negative binomial strategies with DESeq2 before zero handling ROC: 0.9225 False discovery proportion: 0.1431699 Precision: 0.8568301 Recall: 0.6837901
image(bum.pval.obif.AB) # Tresholded data scaling strategy with OBIF after zero handling ROC: 0.9256 (≈) False discovery proportion: 0.01589925 (↓↓) Precision: 0.9841008 (↑↑) Recall: 0.6955667 (↑)
image(bum.pval.des2.AB) # Negative binomial strategies with DESeq2 before zero handling ROC: 0.8902 False discovery proportion: 0.5565683 Precision: 0.4434317 Recall: 0.5657556
image(bum.pval.obif.AB) # Tresholded data scaling strategy with OBIF after zero handling ROC: 0.8457 (≈) False discovery proportion: 0.09386081 (↓↓) Precision: 0.9061392 (↑↑) Recall: 0.4203641 (↓)
Before zero handling (Original) After zero handling (OBIF)
With OBIF, we detected improved detection of interaction effect (Pr|t| ↓), model fitness (Adj. R2 ↑↑) and F-statistic (p-value ↓↓). Reanalysis of other datasets from Malik’s datasets performed similar in all evaluations.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(OBIF)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.