#' @inherit bcbioRNASeq-class title description
#' @author Michael Steinbaugh, Lorena Pantano, Rory Kirchner, Victor Barrera
#' @note Updated 2022-05-09.
#' @export
#' @details
#' Automatically imports RNA-seq counts, metadata, and the program versions used
#' from a [bcbio][] RNA-seq run. Simply point to the final upload directory
#' generated by bcbio, and this generator function will take care of the rest.
#' [bcbio]: https://bcbio-nextgen.readthedocs.io/en/latest/
#' @section Sample metadata:
#' When loading a bcbio RNA-seq run, the sample metadata will be imported
#' automatically from the `project-summary.yaml` file in the final upload
#' directory. If you notice any typos in your metadata after completing the run,
#' these can be corrected by editing the YAML file.
#' Alternatively, you can pass in a sample metadata file into the
#' `bcbioRNASeq()` function call using the `sampleMetadataFile` argument. This
#' requires either a CSV or Excel spreadsheet.
#' The samples in the bcbio run must map to the `description` column. The values
#' provided in `description` must be unique. These values will be sanitized into
#' syntactically valid names (see `make.names()` for more information), and
#' assigned as the column names of the `bcbioRNASeq` object. The original values
#' are stored as the `sampleName` column in `colData`, and are used for all
#' plotting functions. Do not attempt to set a `sampleId` column, as this is
#' used internally by the package.
#' Here is a minimal example of a properly formatted sample metadata file:
#' \tabular{ll}{
#' description \tab genotype\cr
#' sample1 \tab wildtype\cr
#' sample2 \tab knockout\cr
#' sample3 \tab wildtype\cr
#' sample4 \tab knockout
#' }
#' @section Valid names:
#' R is strict about values that are considered valid for use in `names()` and
#' `dimnames()` (i.e. `rownames()` and `colnames()`. Non-alphanumeric
#' characters, spaces, and **dashes** are not valid. Use either underscores or
#' periods in place of dashes when working in R. Also note that names should
#' **not begin with a number**, and will be prefixed with an `X` when sanitized.
#' Consult the documentation in the `make.names()` function for more
#' information. We strongly recommend adhering to these conventions when
#' labeling samples, to help avoid unexpected downstream behavior in R due to
#' `dimnames()` mismatches.
#' @section Genome annotations:
#' `bcbioRNASeq()` provides support for automatic import of genome annotations,
#' which internally get processed into genomic ranges (`GRanges`) and are
#' slotted into the `rowRanges()` of the object. Currently, we offer support for
#' (1) [Ensembl][] genome annotations from [AnnotationHub][] via [ensembldb][]
#' (*recommended*); or (2) direct import from a GTF/GFF file using
#' [rtracklayer][].
#' [ensembldb][] requires the `organism` and `ensemblRelease` arguments to be
#' defined. When both of these are set, `bcbioRNASeq` will attempt to
#' download and use the pre-built [Ensembl][] genome annotations from
#' [AnnotationHub][]. This method is preferred over direct loading of a GTF/GFF
#' file because the [AnnotationHub][] annotations contain additional rich
#' metadata not defined in a GFF file, specifically `description` and `entrezId`
#' values.
#' Alternatively, if you are working with a non-standard or poorly annotated
#' genome that isn't available on [AnnotationHub][], we provide fall back
#' support for loading the genome annotations directly from the GTF file used by
#' the bcbio RNA-seq pipeline. This should be fully automatic for an R session
#' active on the same server used to run [bcbio][].
#' Example bcbio GTF path: `genomes/Hsapiens/hg38/rnaseq/ref-transcripts.gtf`.
#' In the event that you are working from a remote environment that doesn't
#' have file system access to the [bcbio][] `genomes` directory, we provide
#' additional fall back support for importing genome annotations from a GTF/GFF
#' directly with the `gffFile` argument.
#' Internally, genome annotations are imported via the [AcidGenomes][] package,
#' specifically with either of these functions:
#' - `AcidGenomes::makeGRangesFromEnsembl()`.
#' - `AcidGenomes::makeGRangesFromGff()`.
#' [acidgenomes]: https://r.acidgenomics.com/packages/acidgenomes/
#' [annotationhub]: https://bioconductor.org/packages/AnnotationHub/
#' [bcbio]: https://bcbio-nextgen.readthedocs.io/en/latest/
#' [ensembl]: https://useast.ensembl.org/
#' [ensembldb]: https://bioconductor.org/packages/ensembldb/
#' [rtracklayer]: https://bioconductor.org/packages/rtracklayer/
#' @section Genome build:
#' Ensure that the organism and genome build used with bcio match correctly here
#' in the function call. In particular, for the legacy *Homo sapiens*
#' GRCh37/hg19 genome build, ensure that `genomeBuild = "GRCh37"`. Otherwise,
#' the genomic ranges set in `rowRanges()` will mismatch. It is recommended for
#' current projects that GRCh38/hg38 is used in place of GRCh37/hg19 if
#' possible.
#' @section DESeq2:
#' DESeq2 is run automatically when `bcbioRNASeq()` is called, unless
#' `fast = TRUE` is set. Internally, this automatically slots normalized counts
#' into `assays()`, and generates variance-stabilized counts.
#' @section Remote connections:
#' When working on a local machine, it is possible to load bcbio run data over a
#' remote connection using [sshfs][]. When loading a large number of samples, it
#' is preferable to call `bcbioRNASeq()` directly in R on the remote server, if
#' possible.
#' [sshfs]: https://github.com/osxfuse/osxfuse/wiki/SSHFS
#' @inheritParams AcidExperiment::makeSummarizedExperiment
#' @inheritParams AcidRoxygen::params
#' @param level `character(1)`.
#' Import counts at gene level ("`genes`"; *default*) or transcript level
#' ("`transcripts`"; *advanced use*). Only tximport-compatible callers (e.g.
#' salmon, kallisto, sailfish) can be loaded at transcript level. Aligned
#' counts from featureCounts-compatible callers (e.g. STAR, HISAT2) can only
#' be loaded at gene level.
#' @param caller `character(1)`.
#' Expression caller:
#' - `"salmon"` (*default*): [Salmon][] alignment-free, quasi-mapped counts.
#' - `"kallisto"`: [Kallisto][] alignment-free, pseudo-aligned counts.
#' - `"sailfish"`: [Sailfish][] alignment-free, lightweight counts.
#' - `"star"`: [STAR][] (Spliced Transcripts Alignment to a Reference)
#' aligned counts.
#' - `"hisat2"`: [HISAT2][] (Hierarchical Indexing for Spliced Alignment of
#' Transcripts) graph-based aligned counts.
#' [HISAT2]: https://daehwankimlab.github.io/hisat2/
#' [Kallisto]: https://pachterlab.github.io/kallisto/
#' [Sailfish]: https://www.cs.cmu.edu/~ckingsf/software/sailfish/
#' [Salmon]: https://combine-lab.github.io/salmon/
#' [STAR]: https://github.com/alexdobin/STAR/
#' @param countsFromAbundance `character(1)`.
#' Whether to generate estimated counts using abundance estimates
#' (*recommended by default*). `lengthScaledTPM` is a suitable default, and
#' counts are scaled using the average transcript length over samples and then
#' the library size. Refer to `tximport::tximport()` for more information on
#' this parameter, but it should only ever be changed when loading some
#' datasets at transcript level (e.g. for DTU analsyis).
#' @param fast `logical(1)`.
#' Fast mode.
#' Skip internal DESeq2 calculations and transformations.
#' Don't enable this setting when using the quality control R Markdown
#' template.
#' Note that some plotting functions, such as `plotPca()` will not work when
#' this mode is enabled.
#' @return `bcbioRNASeq`.
#' @seealso
#' - `.S4methods(class = "bcbioRNASeq")`.
#' - `SummarizedExperiment::SummarizedExperiment()`.
#' - `methods::initialize()`.
#' - `methods::validObject()`.
#' - `BiocGenerics::updateObject()`.
#' @examples
#' uploadDir <- system.file("extdata/bcbio", package = "bcbioRNASeq")
#' ## Gene level.
#' object <- bcbioRNASeq(
#' uploadDir = uploadDir,
#' level = "genes",
#' caller = "salmon",
#' organism = "Mus musculus",
#' ensemblRelease = 87L
#' )
#' print(object)
#' ## Transcript level.
#' object <- bcbioRNASeq(
#' uploadDir = uploadDir,
#' level = "transcripts",
#' caller = "salmon",
#' organism = "Mus musculus",
#' ensemblRelease = 87L
#' )
#' print(object)
#' ## Fast mode.
#' object <- bcbioRNASeq(uploadDir = uploadDir, fast = TRUE)
bcbioRNASeq <-
level = c("genes", "transcripts"),
caller = c("salmon", "kallisto", "sailfish", "star", "hisat2"),
samples = NULL,
censorSamples = NULL,
sampleMetadataFile = NULL,
organism = NULL,
genomeBuild = NULL,
ensemblRelease = NULL,
gffFile = NULL,
transgeneNames = NULL,
countsFromAbundance = "lengthScaledTPM",
interestingGroups = "sampleName",
fast = FALSE) {
level <- match.arg(level)
caller <- match.arg(caller)
if (identical(level, "transcripts")) {
assert(isSubset(caller, .tximportCallers))
isAny(samples, classes = c("character", "NULL")),
isAny(censorSamples, classes = c("character", "NULL")),
isString(sampleMetadataFile, nullOk = TRUE),
isString(organism, nullOk = TRUE),
isString(genomeBuild, nullOk = TRUE),
isInt(ensemblRelease, nullOk = TRUE),
isAny(transgeneNames, classes = c("character", "NULL")),
isString(gffFile, nullOk = TRUE),
if (isString(gffFile)) {
assert(isAFile(gffFile) || isAUrl(gffFile))
## Don't allow AnnotationHub formals when specifying GFF file.
if (!is.null(gffFile)) {
## Organism is required when we're defining the genome.
if (
!is.null(genomeBuild) ||
!is.null(ensemblRelease) ||
) {
arg = countsFromAbundance,
choices = eval(formals(tximport)[["countsFromAbundance"]])
alertInfo("Importing bcbio-nextgen RNA-seq run.")
## Run info ------------------------------------------------------------
h2("Run info")
uploadDir <- realpath(uploadDir)
dl(c("uploadDir" = uploadDir))
projectDir <- projectDir(uploadDir)
sampleDirs <- sampleDirs(uploadDir)
yamlFile <- file.path(projectDir, "project-summary.yaml")
yaml <- import(yamlFile)
dataVersions <-
importDataVersions(file.path(projectDir, "data_versions.csv"))
assert(is(dataVersions, "DFrame"))
programVersions <-
importProgramVersions(file.path(projectDir, "programs.txt"))
assert(is(programVersions, "DFrame"))
log <- import(file.path(projectDir, "bcbio-nextgen.log"))
## This step enables our minimal dataset to pass checks.
expr = assert(isCharacter(log)),
error = function(e) {
"{.file %s} file is empty.",
fastPipeline <- .isFastPipeline(log)
if (isTRUE(fastPipeline)) {
alertInfo("Fast RNA-seq pipeline detected.")
commandsLog <-
import(file.path(projectDir, "bcbio-nextgen-commands.log"))
## This step enables our minimal dataset to pass checks.
expr = assert(isCharacter(commandsLog)),
error = function(e) {
"{.file %s} file is empty.",
lanes <- detectLanes(sampleDirs)
assert(isInt(lanes) || identical(lanes, integer()))
## Column data (samples) -----------------------------------------------
h2("Sample metadata")
## Get the sample data.
if (isString(sampleMetadataFile)) {
## Normalize path of local file.
if (file.exists(sampleMetadataFile)) {
sampleMetadataFile <- realpath(sampleMetadataFile)
## User-defined metadata file.
sampleData <- importSampleData(
file = sampleMetadataFile,
lanes = lanes,
pipeline = "bcbio"
} else {
## Automatic metadata from YAML file.
sampleData <- getSampleDataFromYaml(yaml)
assert(isSubset(rownames(sampleData), names(sampleDirs)))
## Subset the sample directories, if necessary.
if (is.character(samples) || is.character(censorSamples)) {
## Matching against the YAML "description" input here.
description <- as.character(sampleData[["description"]])
if (is.character(samples)) {
assert(isSubset(samples, description))
} else {
samples <- description
if (is.character(censorSamples)) {
assert(isSubset(censorSamples, samples))
samples <- setdiff(samples, censorSamples)
keep <- sampleData[["description"]] %in% samples
sampleData <- sampleData[keep, , drop = FALSE]
samples <- rownames(sampleData)
isSubset(samples, names(sampleDirs)),
if (length(samples) < length(sampleDirs)) {
sampleDirs <- sampleDirs[samples]
txt("Loading a subset of samples:")
allSamples <- FALSE
} else {
allSamples <- TRUE
## Ensure fast mode is enabled for minimal datasets where DESeq2
## calculations are not appropriate.
if (length(samples) < 4L && isFALSE(fast)) {
n <- length(samples)
"Minimal dataset containing %d %s detected.",
n = n,
msg1 = "sample",
msg2 = "samples"
alert("Enabling fast mode, which skips DESeq2 calculations.")
fast <- TRUE
## Sample metrics. Note that sample metrics used for QC plots are not
## currently generated when using fast RNA-seq workflow. This depends
## upon MultiQC and aligned counts generated with STAR.
colData <- getMetricsFromYaml(yaml)
if (hasLength(colData)) {
areDisjointSets(colnames(colData), colnames(sampleData)),
isSubset(rownames(sampleData), rownames(colData))
colData <- colData[rownames(sampleData), , drop = FALSE]
colData <- cbind(colData, sampleData)
} else {
colData <- sampleData
is(colData, "DFrame"),
identical(samples, rownames(colData))
## Assays (counts) -----------------------------------------------------
assays <- SimpleList()
## Use tximport by default for transcript-aware callers. Otherwise,
## resort to loading the featureCounts aligned counts data. As of
## v0.3.22, we're alternatively slotting the aligned counts as "aligned"
## matrix when pseudoaligned counts are defined in the primary "counts"
## assay.
if (isSubset(caller, .tximportCallers)) {
txOut <- identical(level, "transcripts")
if (isTRUE(txOut)) {
tx2gene <- NULL
} else {
tx2gene <- importTxToGene(
file = file.path(projectDir, "tx2gene.csv"),
organism = organism,
genomeBuild = genomeBuild,
release = ensemblRelease
assert(is(tx2gene, "TxToGene"))
txi <- .tximport(
sampleDirs = sampleDirs,
type = caller,
txOut = txOut,
countsFromAbundance = countsFromAbundance,
tx2gene = tx2gene
## Raw counts. Length scaled by default (see `countsFromAbundance`).
## These counts are expected to be non-integer.
assays[["counts"]] <- txi[["counts"]]
## Transcripts per million.
assays[["tpm"]] <- txi[["abundance"]]
## Average transcript lengths.
assays[["avgTxLength"]] <- txi[["length"]]
if (
identical(level, "genes") &&
!isTRUE(fastPipeline) &&
) {
assays[["aligned"]] <- .featureCounts(
projectDir = projectDir,
samples = samples,
genes = rownames(txi[["counts"]])
} else if (isSubset(caller, .featureCountsCallers)) {
countsFromAbundance <- NULL
tx2gene <- NULL
txi <- NULL
assert(identical(level, "genes"))
"Slotting aligned counts into primary {.fun %s} assay.",
assays[["counts"]] <- .featureCounts(
projectDir = projectDir,
samples = samples
identical(names(assays)[[1L]], "counts"),
identical(colnames(assays[[1L]]), rownames(colData))
## Row data (genes/transcripts) ----------------------------------------
h2("Feature metadata")
## Annotation priority:
## 1. AnnotationHub.
## - Requires `organism` to be declared.
## - Ensure that Ensembl release and genome build match.
## 2. GTF/GFF file. Use the bcbio GTF if possible.
## 3. Fall back to slotting empty ranges. This is offered as support for
## complex datasets (e.g. multiple organisms).
if (isString(organism) && is.numeric(ensemblRelease)) {
## AnnotationHub (ensembldb).
rowRanges <- makeGRangesFromEnsembl(
organism = organism,
level = level,
genomeBuild = genomeBuild,
release = ensemblRelease,
ignoreVersion = TRUE
} else {
## GTF/GFF file.
if (is.null(gffFile)) {
## Attempt to use bcbio GTF automatically.
gffFile <- getGtfFileFromYaml(yaml)
if (!is.null(gffFile) && isFALSE(fast)) {
rowRanges <- makeGRangesFromGff(
file = gffFile,
level = level,
ignoreVersion = TRUE
} else {
"Slotting empty ranges into {.fun %s}.",
rowRanges <- emptyRanges(rownames(assays[[1L]]))
assert(is(rowRanges, "GRanges"))
## Attempt to get genome build and Ensembl release if not declared. Note
## that these will remain NULL when using GTF file (see above).
if (is.null(genomeBuild)) {
genomeBuild <- metadata(rowRanges)[["genomeBuild"]]
if (is.null(ensemblRelease)) {
ensemblRelease <- metadata(rowRanges)[["ensemblRelease"]]
## Metadata ------------------------------------------------------------
## Interesting groups.
interestingGroups <- camelCase(interestingGroups, strict = TRUE)
assert(isSubset(interestingGroups, colnames(colData)))
## Organism.
## Attempt to detect automatically if not declared by user.
if (is.null(organism)) {
organism <- tryCatch(
expr = detectOrganism(rownames(assays[[1L]])),
error = function(e) {
fmt = paste(
"Failed to detect organism automatically.",
"Specify with {.arg %s} argument.",
sep = "\n"
metadata <- list(
"allSamples" = allSamples,
"bcbioCommandsLog" = commandsLog,
"bcbioLog" = log,
"call" = standardizeCall(),
"caller" = caller,
"countsFromAbundance" = countsFromAbundance,
"dataVersions" = dataVersions,
"ensemblRelease" = as.integer(ensemblRelease),
"fast" = fast,
"genomeBuild" = as.character(genomeBuild),
"gffFile" = as.character(gffFile),
"interestingGroups" = interestingGroups,
"lanes" = lanes,
"level" = level,
"organism" = as.character(organism),
"packageVersion" = .pkgVersion,
"programVersions" = programVersions,
"projectDir" = projectDir,
"runDate" = runDate(projectDir),
"sampleDirs" = sampleDirs,
"sampleMetadataFile" = as.character(sampleMetadataFile),
"tx2gene" = tx2gene,
"uploadDir" = uploadDir,
"yaml" = yaml
## Make bcbioRNASeq object ---------------------------------------------
rse <- makeSummarizedExperiment(
assays = assays,
rowRanges = rowRanges,
colData = colData,
metadata = metadata,
transgeneNames = transgeneNames
bcb <- new(Class = "bcbioRNASeq", rse)
## DESeq2 --------------------------------------------------------------
if (level == "genes" && isFALSE(fast)) {
dds <- tryCatch(
expr = {
h2(sprintf("{.pkg %s} normalizations", "DESeq2"))
dds <- as(bcb, "DESeqDataSet")
alert(sprintf("{.fun %s}", "estimateSizeFactors"))
dds <- estimateSizeFactors(dds)
if (!.dataHasVariation(dds)) {
fmt = paste(
"Skipping {.pkg %s} calculations.",
"Data set does not have enough variation.",
sep = "\n"
alert(sprintf("{.fun %s}", "DESeq"))
dds <- DESeq(dds)
error = function(e) {
"Skipping {.pkg %s} calculations.",
if (is(dds, "DESeqDataSet")) {
assays(bcb)[["normalized"]] <- counts(dds, normalized = TRUE)
alert(sprintf("{.fun %s}", "varianceStabilizingTransformation"))
vst <- varianceStabilizingTransformation(dds)
assert(is(vst, "DESeqTransform"))
assays(bcb)[["vst"]] <- assay(vst)
## Calculate FPKM. Skip this step if we've slotted empty ranges.
if (length(unique(width(rowRanges(dds)))) > 1L) {
alert(sprintf("{.fun %s}", "fpkm"))
fpkm <- fpkm(dds)
assays(bcb)[["fpkm"]] <- fpkm
} else {
fmt = paste(
"{.fun %s}: Skipping FPKM calculation because",
"{.fun %s} is empty."
"fpkm", "rowRanges"
## Return --------------------------------------------------------------
alertSuccess("bcbio RNA-seq run imported successfully.")
