This document shows typical Topconfects usage with limma, edgeR, or DESeq2.
The first step is to load a dataset. Here, we're looking at RNA-seq data that
investigates the response of Arabodopsis thaliana to a bacterial pathogen.
Besides the experimental and control conditions, there is also a batch effect.
This dataset is also examined in section 4.2 of the edgeR
user manual, and
I've followed the initial filtering steps in the edgeR
manual.
knitr::opts_chunk$set(fig.width=7, fig.height=6)
library(topconfects) library(NBPSeq) library(edgeR) library(limma) library(dplyr) library(ggplot2) data(arab) # Retrieve symbol for each gene info <- AnnotationDbi::select( org.At.tair.db::org.At.tair.db, rownames(arab), "SYMBOL") %>% group_by(TAIR) %>% summarize( SYMBOL=paste(unique(na.omit(SYMBOL)),collapse="/")) arab_info <- info[match(rownames(arab),info$TAIR),] %>% select(-TAIR) %>% as.data.frame rownames(arab_info) <- rownames(arab) # Extract experimental design from sample names Treat <- factor(substring(colnames(arab),1,4)) %>% relevel(ref="mock") Time <- factor(substring(colnames(arab),5,5)) y <- DGEList(arab, genes=as.data.frame(arab_info)) # Keep genes with at least 3 samples having an RPM of more than 2 keep <- rowSums(cpm(y)>2) >= 3 y <- y[keep,,keep.lib.sizes=FALSE] y <- calcNormFactors(y)
design <- model.matrix(~Time+Treat) design[,] fit <- voom(y, design) %>% lmFit(design)
Find largest fold changes that we can be confident of at FDR 0.05.
confects <- limma_confects(fit, coef="Treathrcc", fdr=0.05) confects
Here the usual logFC
values estimated by limma
are shown as dots, with lines
to the confect
value.
confects_plot(confects)
confects_plot_me
overlays the confects (black) on a Mean-Difference Plot
(grey) (as might be produced by limma::plotMD
). As we should expect, the very
noisy differences with low mean expression are removed if we look at the
confects.
confects_plot_me(confects)
Let's compare this to the ranking we obtain from topTable
.
fit_eb <- eBayes(fit) top <- topTable(fit_eb, coef="Treathrcc", n=Inf) rank_rank_plot(confects$table$name, rownames(top), "limma_confects", "topTable")
You can see that the top 19 genes from topTable are all within the top 40 for topconfects ranking, but topconfects has also highly ranked some other genes. These have a large effect size, and sufficient if not overwhelming evidence of this.
An MD-plot highlighting the positions of the top 40 genes in both rankings also illustrates the differences between these two ways of ranking genes.
plotMD(fit, legend="bottomleft", status=paste0( ifelse(rownames(fit) %in% rownames(top)[1:40], "topTable ",""), ifelse(rownames(fit) %in% confects$table$name[1:40], "confects ","")))
An analysis in edgeR produces similar results. Note that only quasi-likelihood testing from edgeR is supported.
y <- estimateDisp(y, design, robust=TRUE) efit <- glmQLFit(y, design, robust=TRUE)
A step of 0.05 is used here merely so that the vignette will build quickly.
edger_confects
calls edgeR::glmTreat
repeatedly, which is necessarily slow.
In practice a smaller value such as 0.01 should be used.
econfects <- edger_confects(efit, coef="Treathrcc", fdr=0.05, step=0.05) econfects
confects_plot(econfects) confects_plot_me(econfects)
etop <- glmQLFTest(efit, coef="Treathrcc") %>% topTags(n=Inf) plotMD(efit, legend="bottomleft", status=paste0( ifelse(rownames(efit) %in% econfects$table$name[1:40], "confects ", ""), ifelse(rownames(efit) %in% rownames(etop)[1:40], "topTags ","")))
DESeq2 does its own filtering of lowly expressed genes, so we start from the original count matrix. The initial steps are as for a normal DESeq2 analysis.
library(DESeq2) dds <- DESeqDataSetFromMatrix( countData = arab, colData = data.frame(Time, Treat), rowData = arab_info, design = ~Time+Treat) dds <- DESeq(dds)
The contrast or coefficient to test is specified as in the DESeq2::results
function. The step of 0.05 is merely so that this vignette will build quickly,
in practice a smaller value such as 0.01 should be used. deseq2_confects
calls results
repeatedly, and in fairness results
has not been optimized for this.
dconfects <- deseq2_confects(dds, name="Treat_hrcc_vs_mock", step=0.05)
DESeq2 offers shrunken estimates of LFC. This is another sensible way of ranking genes. Let's compare them to the confect values.
shrunk <- lfcShrink(dds, coef="Treat_hrcc_vs_mock", type="ashr") dconfects$table$shrunk <- shrunk$log2FoldChange[dconfects$table$index] dconfects
DESeq2 filters some genes, these are placed last in the table. If your intention
is to obtain a ranking of all genes, you should disable this with
deseq2_confects(..., cooksCutoff=Inf, independentFiltering=FALSE)
.
table(dconfects$table$filtered) tail(dconfects$table)
Shrunk LFC estimates are shown in red.
confects_plot(dconfects) + geom_point(aes(x=shrunk, size=baseMean, color="lfcShrink"), alpha=0.75)
lfcShrink
aims for a best estimate of the LFC, whereas confect is a
conservative estimate. lfcShrink
can produce non-zero values for genes which
can't be said to significantly differ from zero -- it doesn't do double duty as
an indication of significance -- whereas the confect value will be NA
in this
case. The plot below compares these two quantities. Only un-filtered genes are
shown (see above).
filter(dconfects$table, !filtered) %>% ggplot(aes( x=ifelse(is.na(confect),0,confect), y=shrunk, color=!is.na(confect))) + geom_point() + geom_abline() + coord_fixed() + theme_bw() + labs(color="Significantly\nnon-zero at\nFDR 0.05", x="confect", y="lfcShrink using ashr")
rank_rank_plot(confects$table$name, econfects$table$name, "limma confects", "edgeR confects") rank_rank_plot(confects$table$name, dconfects$table$name, "limma confects", "DESeq2 confects") rank_rank_plot(econfects$table$name, dconfects$table$name, "edgeR confects", "DESeq2 confects")
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.