r software
results explorationProject: r project
.
This report is meant to help explore r software
r if(software == 'edgeR') citep(bib[[c('edgeR1', 'edgeR2', 'edgeR6')]]) else if (software == 'DESeq2') citep(bib[['DESeq2']]) else citep(bib[['other']])
results and was generated using the regionReport
r citep(bib[['regionReport']])
package. While the report is rich, it is meant to just start the exploration of the results and exemplify some of the code used to do so. If you need a more in-depth analysis for your specific data set you might want to use the customCode
argument. This report is based on the vignette of the DESeq2
r citep(bib[['DESeq2']])
package which you can find here.
This section contains the code for setting up the rest of the report.
## knitrBoostrap and device chunk options library('knitr') opts_chunk$set(bootstrap.show.code = FALSE, dev = device) opts_chunk$set(comment=NA, fig.width=10, fig.height=10) if(!outputIsHTML) opts_chunk$set(bootstrap.show.code = FALSE, dev = device, echo = FALSE)
#### Libraries needed ## Bioconductor library('DESeq2') if(isEdgeR) library('edgeR') ## CRAN library('ggplot2') if(!is.null(theme)) theme_set(theme) library('knitr') if(is.null(colors)) { library('RColorBrewer') } library('pheatmap') library('EnhancedVolcano') library('DT') library('sessioninfo') #### Code setup ## For ggplot res.df <- as.data.frame(res) ## Sort results by adjusted p-values ord <- order(res.df$padj, decreasing = FALSE) res.df <- res.df[ord, ] features <- rownames(res.df) res.df <- cbind(data.frame(Feature = features), res.df) rownames(res.df) <- NULL
## Transform count data rld <- tryCatch(rlog(dds), error = function(e) { rlog(dds, fitType = 'mean') }) ## Perform PCA analysis and make plot plotPCA(rld, intgroup = intgroup) ## Get percent of variance explained data_pca <- plotPCA(rld, intgroup = intgroup, returnData = TRUE) percentVar <- round(100 * attr(data_pca, "percentVar"))
The above plot shows the first two principal components that explain the variability in the data using the regularized log count data. If you are unfamiliar with principal component analysis, you might want to check the Wikipedia entry or this interactive explanation. In this case, the first and second principal component explain r percentVar[1]
and r percentVar[2]
percent of the variance respectively.
EnhancedVolcano(res, lab = rownames(res),x = 'log2FoldChange',y = 'pvalue', xlim = c(-5, 8))
EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pvalue', xlim = c(-8, 8), title = 'FC Cutoff 1.5, P-value cutoff 10e-16', pCutoff = 10e-16, FCcutoff = 1.5, transcriptPointSize = 1.5, transcriptLabSize = 3.0, col=c('black', 'black', 'black', 'red3'), colAlpha = 1)
## Obtain the sample euclidean distances sampleDists <- dist(t(assay(rld))) sampleDistMatrix <- as.matrix(sampleDists) ## Add names based on intgroup rownames(sampleDistMatrix) <- apply(as.data.frame(colData(rld)[, intgroup]), 1, paste, collapse = ' : ') colnames(sampleDistMatrix) <- NULL ## Define colors to use for the heatmap if none were supplied if(is.null(colors)) { colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) } ## Make the heatmap pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, color = colors)
This plot shows how samples are clustered based on their euclidean distance using the regularized log transformed count data. This figure gives an overview of how the samples are hierarchically clustered. It is a complementary figure to the PCA plot.
This section contains three MA plots (see Wikipedia) that compare the mean of the normalized counts against the log fold change. They show one point per feature. The points are shown in red if the feature has an adjusted p-value less than alpha
, that is, the statistically significant features are shown in red.
## MA plot with alpha used in DESeq2::results() plotMA(res, alpha = metadata(res)$alpha, main = paste('MA plot with alpha =', metadata(res)$alpha))
This first plot shows uses alpha
= r metadata(res)$alpha
, which is the alpha
value used to determine which resulting features were significant when running the function DESeq2::results()
.
## MA plot with alpha = 1/2 of the alpha used in DESeq2::results() plotMA(res, alpha = metadata(res)$alpha / 2, main = paste('MA plot with alpha =', metadata(res)$alpha / 2))
This second MA plot uses alpha
= r metadata(res)$alpha / 2
and can be used agains the first MA plot to identify which features have adjusted p-values between r metadata(res)$alpha / 2
and r metadata(res)$alpha
.
## MA plot with alpha corresponding to the one that gives the nBest features nBest.actual <- min(nBest, nrow(head(res.df, n = nBest))) nBest.alpha <- head(res.df, n = nBest)$padj[nBest.actual] plotMA(res, alpha = nBest.alpha * 1.00000000000001, main = paste('MA plot for top', nBest.actual, 'features'))
The third and final MA plot uses an alpha such that the top r nBest.actual
features are shown in the plot. These are the features that whose details are included in the top features interactive table.
## P-value histogram plot ggplot(res.df[!is.na(res.df$pvalue), ], aes(x = pvalue)) + geom_histogram(alpha=.5, position='identity', bins = 50) + labs(title='Histogram of unadjusted p-values') + xlab('Unadjusted p-values') + xlim(c(0, 1.0005))
This plot shows a histogram of the unadjusted p-values. It might be skewed right or left, or flat as shown in the Wikipedia examples. The shape depends on the percent of features that are differentially expressed. For further information on how to interpret a histogram of p-values check David Robinson's post on this topic.
## P-value distribution summary summary(res.df$pvalue)
This is the numerical summary of the distribution of the p-values.
## Split features by different p-value cutoffs pval_table <- lapply(c(1e-04, 0.001, 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), function(x) { data.frame('Cut' = x, 'Count' = sum(res.df$pvalue <= x, na.rm = TRUE)) }) pval_table <- do.call(rbind, pval_table) if(outputIsHTML) { kable(pval_table, format = 'markdown', align = c('c', 'c')) } else { kable(pval_table) }
This table shows the number of features with p-values less or equal than some commonly used cutoff values.
## Adjusted p-values histogram plot ggplot(res.df[!is.na(res.df$padj), ], aes(x = padj)) + geom_histogram(alpha=.5, position='identity', bins = 50) + labs(title=paste('Histogram of', elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)])) + xlab('Adjusted p-values') + xlim(c(0, 1.0005))
This plot shows a histogram of the r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
. It might be skewed right or left, or flat as shown in the Wikipedia examples.
## Adjusted p-values distribution summary summary(res.df$padj)
This is the numerical summary of the distribution of the r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
.
## Split features by different adjusted p-value cutoffs padj_table <- lapply(c(1e-04, 0.001, 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), function(x) { data.frame('Cut' = x, 'Count' = sum(res.df$padj <= x, na.rm = TRUE)) }) padj_table <- do.call(rbind, padj_table) if(outputIsHTML) { kable(padj_table, format = 'markdown', align = c('c', 'c')) } else { kable(padj_table) }
This table shows the number of features with r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
less or equal than some commonly used cutoff values.
This r ifelse(outputIsHTML, 'interactive', '')
table shows the top r nBest.actual
features ordered by their r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
. r ifelse(outputIsHTML, 'Use the search function to find your feature of interest or sort by one of the columns.', 'Since the report is in PDF format, only the top 20 features are shown.')
## Add search url if appropriate if(!is.null(searchURL) & outputIsHTML) { res.df$Feature <- paste0('<a href="', searchURL, res.df$Feature, '">', res.df$Feature, '</a>') } for(i in which(colnames(res.df) %in% c('pvalue', 'padj'))) res.df[, i] <- format(res.df[, i], scientific = TRUE) if(outputIsHTML) { datatable(head(res.df, n = nBest), options = list(pagingType='full_numbers', pageLength=10, scrollX='100%'), escape = FALSE, rownames = FALSE) %>% formatRound(which(!colnames(res.df) %in% c('pvalue', 'padj', 'Feature')), digits) } else { res.df_top <- head(res.df, n = 20) for(i in which(!colnames(res.df) %in% c('pvalue', 'padj', 'Feature'))) res.df_top[, i] <- round(res.df_top[, i], digits) kable(res.df_top) }
This section contains plots showing the normalized counts per sample for each group of interest. Only the best r nBestFeatures
features are shown, ranked by their r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
. The Y axis is on the log10 scale and the feature name is shown in the title of each plot.
plotCounts_gg <- function(i, dds, intgroup) { group <- if (length(intgroup) == 1) { colData(dds)[[intgroup]] } else if (length(intgroup) == 2) { lvls <- as.vector(t(outer(levels(colData(dds)[[intgroup[1]]]), levels(colData(dds)[[intgroup[2]]]), function(x, y) paste(x, y, sep = " : ")))) droplevels(factor(apply(as.data.frame(colData(dds)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "), levels = lvls)) } else { factor(apply(as.data.frame(colData(dds)[, intgroup, drop = FALSE]), 1, paste, collapse = " : ")) } data <- plotCounts(dds, gene=i, intgroup=intgroup, returnData = TRUE) ## Change in version 1.15.3 ## It might not be necessary to have any of this if else, but I'm not ## sure that plotCounts(returnData) will always return the 'group' variable. if('group' %in% colnames(data)) { data$group <- group } else { data <- cbind(data, data.frame('group' = group)) } ggplot(data, aes(x = group, y = count)) + geom_point() + ylab('Normalized count') + ggtitle(i) + coord_trans(y = "log10") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) } for(i in head(features, nBestFeatures)) { print(plotCounts_gg(i, dds = dds, intgroup = intgroup)) }
r ifelse(isEdgeR, '# edgeR specific plots', '')
r ifelse(isEdgeR, '## Biological coefficient of variation', '')
## Make BCV plot for edgeR results plotBCV(dge)
r ifelse(isEdgeR, 'This plot shows the feature-wise biological coefficient of variation (BCV) against the feature abundances in log2 counts per million. The plot shows the common, trended and feature-wise BCV estimates. If using _edgeR-robust_ only the trend and tagwise are shown. Check the [edgeR vignette](http://www.bioconductor.org/packages/edgeR) for further details regarding this plot.', '')
r ifelse(isEdgeR, '## MDS plot of distances', '')
## Make MDS plot for edgeR results plotMDS(dge)
r ifelse(isEdgeR, 'This plot is a multidimensional scaling plot of distances between feature expression profiles. It shows the names of the samples in a two-dimensional scatterplot such that the distances are approximately the log2 fold changes between samples. Check the [edgeR vignette](http://www.bioconductor.org/packages/edgeR) for further details regarding this plot.', '')
The input for this report was generated with r software
r if(software == 'edgeR') citep(bib[[c('edgeR1', 'edgeR2', 'edgeR6')]]) else if (software == 'DESeq2') citep(bib[['DESeq2']]) else citep(bib[['other']])
r if(software == 'DESeq2') paste('using version', metadata(dds)$version)
and the resulting features were called significantly differentially expressed if their r elementMetadata(res)$description[grep('adjusted', elementMetadata(res)$description)]
were less than alpha
= r metadata(res)$alpha
. This report was generated in path r tmpdir
using the following call to r ifelse(software != 'edgeR', 'DESeq2Report()', 'edgeReport()')
:
theCall
Date the report was generated.
## Date the report was generated Sys.time()
Wallclock time spent generating the report.
## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits=3)
R
session information.
## Session info options(width = 120) session_info()
Pandoc version used: r rmarkdown::pandoc_version()
.
This report was created with regionReport
r citep(bib[['regionReport']])
using rmarkdown
r citep(bib[['rmarkdown']])
while knitr
r citep(bib[['knitr']])
and DT
r citep(bib[['DT']])
were running behind the scenes. pheatmap
r citep(bib[['pheatmap']])
was used to create the sample distances heatmap. Several plots were made with ggplot2
r citep(bib[['ggplot2']])
.
Citations made with knitcitations
r citep(bib[['knitcitations']])
. The BibTeX file can be found here.
## Print bibliography bibliography()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.