ggplot2::theme_set(AcidPlots::acid_theme_light()) knitr::opts_chunk[["set"]](message = FALSE)
For a high-level overview of our bcbio RNA-seq analysis pipeline, including detailed explanation of the bcbioRNASeq
S4 class definition, first consult our workflow paper published in F1000 Research [@Steinbaugh2018-rc]. This vignette is focused on more advanced usage and edge cases that a user may encounter when attempting to load a bcbio dataset and perform downstream quality control analysis.
Note: if you use bcbioRNASeq in published research, please include this citation:
citation("bcbioRNASeq")
## nolint start suppressPackageStartupMessages({ library(basejump) library(bcbioRNASeq) }) ## nolint end
The bcbioRNASeq
constructor function is the main interface connecting bcbio] output data to interactive use in R. It is highly customizable and supports a number of options for advanced use cases. Consult the documentation for additional details.
help(topic = "bcbioRNASeq", package = "bcbioRNASeq")
We have designed the constructor to work as simply as possible by default. The only required argument is uploadDir
, the path to the bcbio final upload directory specified with upload:
in the YAML configuration. Refer to the bcbio configuration documentation for detailed information on how to set up a bcbio run, which is outside the scope of this vignette.
For example, let's load up the example bcbio dataset stored internally in the package.
uploadDir <- system.file("extdata", "bcbio", package = "bcbioRNASeq")
bcbio outputs RNA-seq data in a standardized directory structure, which is described in detail in our workflow paper.
sort(list.files(path = uploadDir, full.names = FALSE, recursive = TRUE))
By default, bcbioRNASeq imports counts at gene level, which are required for standard differential expression analysis (level = "genes"
). For pseudo-aligned counts (e.g. Salmon, Kallisto, Sailfish) [@Bray2016-me; @Patro2014-qz; @Patro2017-pi], tximport [@Soneson2016-tu] is used internally to aggregate transcript-level counts to gene-level counts, and generates length-scaled transcripts per million (TPM) values. For aligned counts processed with featureCounts [@Liao2014-js] (e.g. STAR, HISAT2) [@Dobin2013-hr; @Dobin2016-qj; @Kim2015-rr], these values are already returned at gene level, and therefore not handled by tximport. Once the gene-level counts are imported during the bcbioRNASeq
call, the DESeq2 package [@Love2014-sq] is then used to generate an internal DESeqDataSet
from which we derive normalized and variance-stabilized counts.
object <- bcbioRNASeq(uploadDir = uploadDir, level = "genes") print(object)
Alternatively, if you want to perform transcript-aware analysis, such as differential exon usage or splicing analysis, transcript-level counts can be obtained using level = "transcripts"
. Note that when counts are loaded at transcript level, TPMs are generated with tximport internally, but no additional normalizations or transformations normally calculated for gene-level counts with DESeq2 are generated.
object <- bcbioRNASeq(uploadDir = uploadDir, level = "transcripts", fast = TRUE) print(object)
Since bcbio is flexible and supports a number of expression callers, we have provided advanced options in the bcbioRNASeq
constructor to support a variety of workflows using the caller
argument. Salmon, Kallisto, and Sailfish counts are supported at either gene or transcript level. Internally, these are loaded using tximport. STAR and HISAT2 aligned counts processed with featureCounts are also supported, but only at gene level.
object <- bcbioRNASeq(uploadDir = uploadDir, caller = "star", fast = TRUE) print(object)
If you'd like to load up only a subset of samples, this can be done easily using the samples
argument. Note that the character
vector declared here must match the description
column specified in the sample metadata. Conversely, if you're working with a large dataset and you simply want to drop a few samples, this can be accomplished with the censorSamples
argument.
When working with a bcbio run that has incorrect or outdated metadata, the simplest way to fix this issue is to pass in new metadata from an external spreadsheet (CSV or Excel) using the sampleMetadataFile
argument. Note that this can also be used to subset the bcbio dataset, similar to the samples
argument (see above), based on the rows that are included in the spreadsheet.
sampleMetadataFile <- file.path(uploadDir, "sample_metadata.csv") stopifnot(file.exists(sampleMetadataFile)) import(sampleMetadataFile)
object <- bcbioRNASeq( uploadDir = uploadDir, sampleMetadataFile = sampleMetadataFile, fast = TRUE ) sampleData(object)
When analyzing a dataset against a well-annotated genome, we recommend importing the corresponding metadata using AnnotationHub and ensembldb. This functionality is natively supported in the bcbioRNASeq
constructor with using the organism
, ensemblRelease
, and genomeBuild
arguments. For example, with our internal bcbio dataset, we're analyzing counts generated against the Ensembl Mus musculus GRCm38 genome build (release 87). These parameters can be defined in the object load call to ensure that the annotations match up exactly with the genome used.
object <- bcbioRNASeq( uploadDir = uploadDir, level = "genes", organism = "Mus musculus", genomeBuild = "GRCm38", ensemblRelease = 87L )
This will return a GRanges
object using the GenomicRanges package [@Lawrence2013-vz], which contains coordinates and rich metadata for each gene or transcript. These annotations are accessible with the rowRanges
and rowData
functions defined in the SummarizedExperiment package [@Huber2015-nw].
summary(rowRanges(object)) summary(rowData(object))
Alternatively, transcript-level annotations can also be obtained automatically using this method.
When working with a dataset generated against a poorly-annotated or non-standard genome, we provide a fallback method for loading gene annotations from a general feature format (GFF) file with the gffFile
argument. If possible, we recommend providing a general transfer format (GTF) file, which is identical to GFF version 2. GFFv3 is more complicated and non-standard, but Ensembl GFFv3 files are also supported.
If your dataset contains transgenes (e.g. EGFP, TDTOMATO), these features can be defined with the transgeneNames
argument, which will automatically populate the rowRanges
slot with placeholder metadata.
We recommend loading up data per genome in its own bcbioRNASeq
object when possible, so that rich metadata can be imported easily. In the edge case where you need to look at multiple genomes simultaneously, set organism = NULL
, and bcbioRNASeq will skip the gene annotation acquisition step.
Refer to the the GenomicRanges and SummarizedExperiment package documentation for more details on working with the genome annotations defined in the rowRanges
slot of the object. Here are some useful examples:
seqnames(object) ranges(object) strand(object)
During the bcbioRNASeq
constructor call, log2 variance stabilizaton of gene-level counts can be calculated automatically, and is recommended. This is performed internally by the DESeq2 package, using the varianceStabilizingTransformation
and/or rlog
functions. These transformations will be slotted into assays
.
summary(counts(object, normalized = "vst"))
To demonstrate the functionality and configuration of the package, we have taken an experiment from the Gene Expression Omnibus (GEO) public repository of expression data to use as an example use case. The RNA-seq data is from a study of acute kidney injury in a mouse model (GSE65267) [@Craciun2016-ha]. The study aims to identify differentially expressed genes in progressive kidney fibrosis and contains samples from mouse kidneys at several time points (n = 3, per time point) after folic acid treatment. From this dataset, we are using a subset of the samples for our use case: before folic acid treatment, and 1, 3, 7 days after treatment.
For the vignette, we are loading a pre-computed version of the example bcbioRNASeq
object used in our workflow paper.
object <- import( con = cacheUrl( pasteUrl( "github.com", "hbc", "bcbioRNASeq", "raw", "f1000v2", "data", "bcb.rda", protocol = "https" ), pkg = "bcbioRNASeq" ) ) object <- updateObject(object) print(object)
For reference, let's take a look at the sample metadata. By comparison, using colData
will return all sample-level metadata, including our quality control metrics generated by bcbio. We recommend instead using sampleData
with the clean = TRUE
argument in reports, which only returns factor
columns of interest.
sampleData(object)
Groups of interest to be used for coloring and sample grouping in the quality control plots can be defined in the bcbioRNASeq
object using the interestingGroups
argument in the bcbioRNASeq
constructor call. Assignment method support with the interestingGroups
function is also provided, which can modify the groups of interest after the object has been created.
The interestingGroups
definition defaults to sampleName
, which is automatically generated from the bcbio description
metadata for demultiplexed bulk RNA-seq samples. In this case, the samples will be grouped and colored uniquely.
Interesting groups must be defined using a character
vector and refer to column names defined in the colData
slot of the object. Note that the bcbioRNASeq package uses lower camel case formatting for column names (e.g. "sampleName"), so the interesting groups should be defined using camel case, not snake (e.g. "sample_name") or dotted case (e.g. "sample.name").
This approach was inspired by the DESeq2 package, which uses the argument intgroup
in some functions, such as plotPca
for labeling groups of interest. We took this idea and ran with it for a number of our quality control functions, which are described in detail below.
interestingGroups(object) <- c("treatment", "day")
Our workflow paper describes the quality control process in detail. In addition, our Quality Control R Markdown template contains notes detailing each metric. Here were are going into more technical detail regarding how to customize the appearance of the plots. Note that all quality control plotting functions inherit the interestingGroups
defined inside the bcbioRNASeq
object. You can also change this dynamially for each function call using the interestingGroups
argument.
plotTotalReads(object) plotMappedReads(object)
Note that the overall mapping rate is relatively low per sample, while the exonic mapping rate is acceptable. High quality samples should have low intronic mapping rate, and high values are indicative of sample degradation and/or contamination.
plotMappingRate(object) plotExonicMappingRate(object) plotIntronicMappingRate(object)
Note that the samples here have a high rRNA mapping rate. This can be indicative of the polyA enrichment or ribo depletion protocol not having removed all ribosomal RNA (rRNA) transcripts. This will reduce the number of biologically meaningful reads in the experiment and is best avoided.
plotRrnaMappingRate(object)
RNA-seq data can have specific biases at either the 5’ or 3’ end of sequenced fragments. It is common to see a small amount of bias, especially if polyA enrichment was performed, or if there is any sample degradation. If a large amount of bias is observed here, be sure to analyze the samples with a Bioanalyzer and check the RIN scores.
plot5Prime3PrimeBias(object)
plotFeaturesDetected(object) plotFeatureSaturation(object)
plotCountsPerFeature(object, geom = "boxplot") plotCountsPerFeature(object, geom = "density")
We can visualize sample similarity with principal component analysis (PCA) and hierarchical clustering of sample correlations. These functions support multiple normalization methods with the normalized
argument.
plotPca(object) plotCorrelationHeatmap(object)
The counts
accessor function returns raw counts by default. Multiple normalization methods are accessible with the normalized
argument.
rawCounts <- counts(object, normalized = FALSE) normalizedCounts <- counts(object, normalized = TRUE) tpm <- counts(object, normalized = "tpm") log2VstCounts <- counts(object, normalized = "vst")
We are providing bcbioRNASeq
method support for the export
function, which automatically exports matrices defined in assays
and corresponding metadata (e.g. rowData
, colData
) to disk in a single call.
files <- export(object, con = tempdir()) print(files)
bcbioRNASeq integrates with DESeq2 and edgeR for differential expression.
To prepare our dataset for analysis, we need to coerce the bcbioRNASeq
object to a DESeqDataSet
. We have defined an S4 coercion method that uses the as
function in the package.
dds <- as(object, "DESeqDataSet") print(dds)
Since both bcbioRNASeq
and DESeqDataSet
classes extend RangedSummarizedExperiment
, internally we coerce the original bcbioRNASeq
object to a RangedSummarizedExperiment
and then use the DESeqDataSet
constructor, which requires a SummarizedExperiment
with integer counts.
The source code for our bcbioRNASeq
to DESeqDataSet
coercion is accessible with getMethod
.
getMethod( f = "coerce", signature = signature(from = "bcbioRNASeq", to = "DESeqDataSet") )
We also provide a parameterized starter R Markdown template for standard DESeq2 differential expression, supporting analysis and visualization of multiple contrasts inside a single report. In our workflow paper, we also describe an example LRT analysis.
Now you can perform differential expression analysis with DESeq2. The authors of that package have provided a number of detailed, well documented references online:
To prepare our dataset for analysis, we need to coerce the bcbioRNASeq
object to a DGEList
. Similar to our DESeq2 approach, we have defined an S4 coercion method that hands off to edgeR.
dge <- as(object, "DGEList") summary(dge)
The source code for our bcbioRNASeq
to DGEList
coercion is accessible with getMethod
.
getMethod( f = "coerce", signature = signature(from = "bcbioRNASeq", to = "DGEList") )
We provide a starter R Markdown template for gene set enrichment analysis (GSEA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [@Kanehisa2000-hk; @Subramanian2005-zk], which leverages the functionality of the clusterProfiler package [@Yu2012-ae]. This workflow is described in detail in our workflow paper, and is outside the scope of this advanced use vignette. For more information on functional analysis, consult the clusterProfiler vignette or Stephen Turner's excellent DESeq2 to fgsea workflow.
utils::sessionInfo()
The papers and software cited in our workflows are available as a shared library on Paperpile.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.