library(knitr)
satuRn
is an R package for performing differential transcript usage analyses
in bulk and single-cell transcriptomics datasets. The package has three main
functions.
The first function, fitDTU
, is used to model transcript usage profiles by
means of a quasi-binomial generalized linear model.
Second, the testDTU
function tests for differential usage of transcripts
between certain groups of interest (e.g. different treatment groups or
cell types).
Finally, the plotDTU
can be used to visualize the usage profiles of
selected transcripts in different groups of interest.
All details about the satuRn
model and statistical tests are described
in our publication [@Gilis2021].
In this vignette, we analyze a small subset of the data from [@Tasic2018].
More specifically, an expression matrix and the corresponding metadata of
the subset data has been provided with the satuRn
package. We will adopt
this dataset to showcase the different functionalities of satuRn
.
satuRn
can be installed from Bioconductor with:
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("satuRn")
Alternatively, the development version of satuRn
can be downloaded with:
devtools::install_github("statOmics/satuRn")
library(satuRn) library(AnnotationHub) library(ensembldb) library(edgeR) library(SummarizedExperiment) library(ggplot2) library(DEXSeq) library(stageR)
The following data corresponds to a small subset of the dataset
from [@Tasic2018] and is readily available from the satuRn
package.
For more details on how the subset was generated, please check
?Tasic_counts_vignette
.
data(Tasic_counts_vignette) # transcript expression matrix data(Tasic_metadata_vignette) # metadata
We start the analysis from scratch, in order to additionally showcase some of the prerequisite steps for performing a DTU analysis.
First, we need an object that links transcripts to their corresponding genes.
We suggest using the BioConductor R packages AnnotationHub
and ensembldb
for this purpose.
ah <- AnnotationHub() # load the annotation resource. all <- query(ah, "EnsDb") # query for all available EnsDb databases ahEdb <- all[["AH75036"]] # for Mus musculus (choose correct release date) txs <- transcripts(ahEdb)
Next, we perform some data wrangling steps to get the data in a format that is
suited for satuRn
. First, we create a DataFrame
or Matrix
linking
transcripts to their corresponding genes.
! Important: satuRn
is implemented such that the columns with transcript
identifiers is names isoform_id
, while the column containing gene identifiers
should be named gene_id
. In addition, following chunk removes transcripts
that are the only isoform expressed of a certain gene, as they cannot be used
in a DTU analysis.
# Get the transcript information in correct format txInfo <- as.data.frame(matrix(data = NA, nrow = length(txs), ncol = 2)) colnames(txInfo) <- c("isoform_id", "gene_id") txInfo$isoform_id <- txs$tx_id txInfo$gene_id <- txs$gene_id rownames(txInfo) <- txInfo$isoform_id # remove transcript version identifiers rownames(Tasic_counts_vignette) <- sub("\\..*", "", rownames(Tasic_counts_vignette)) # Remove transcripts that are the only isoform expressed of a certain gene txInfo <- txInfo[txInfo$isoform_id %in% rownames(Tasic_counts_vignette), ] txInfo <- subset(txInfo, duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE)) Tasic_counts_vignette <- Tasic_counts_vignette[which( rownames(Tasic_counts_vignette) %in% txInfo$isoform_id), ]
Here we perform some feature-level filtering. For this task, we adopt
the filtering criterion that is implemented in the R package edgeR
.
Alternatively, one could adopt the dmFilter
criterion from the DRIMSeq
R
package, which provides a more stringent filtering when both methods are run
in default settings. After filtering, we again remove transcripts that are
the only isoform expressed of a certain gene.
filter_edgeR <- filterByExpr(Tasic_counts_vignette, design = NULL, group = Tasic_metadata_vignette$brain_region, lib.size = NULL, min.count = 10, min.total.count = 30, large.n = 20, min.prop = 0.7 ) # more stringent than default to reduce run time of the vignette table(filter_edgeR) Tasic_counts_vignette <- Tasic_counts_vignette[filter_edgeR, ] # Update txInfo according to the filtering procedure txInfo <- txInfo[which( txInfo$isoform_id %in% rownames(Tasic_counts_vignette)), ] # remove txs that are the only isoform expressed within a gene (after filtering) txInfo <- subset(txInfo, duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE)) Tasic_counts_vignette <- Tasic_counts_vignette[which(rownames( Tasic_counts_vignette) %in% txInfo$isoform_id), ] # satuRn requires the transcripts in the rowData and # the transcripts in the count matrix to be in the same order. txInfo <- txInfo[match(rownames(Tasic_counts_vignette), txInfo$isoform_id), ]
Here we set up the design matrix of the experiment. The subset of the dataset
from [@Tasic2018] contains cells of several different cell types
(variable cluster
) in two different areas of the mouse neocortex
(variable brain_region
). As such, we can model the data with
a factorial design, i.e. by generating a new variable group
that encompasses
all different cell type - brain region combinations.
Tasic_metadata_vignette$group <- paste(Tasic_metadata_vignette$brain_region, Tasic_metadata_vignette$cluster, sep = ".")
All three main functions of satuRn
require a SummarizedExperiment
object as
an input class, or one of its extensions (RangedSummarizedExperiment
,
SingleCellExperiment
). See the SummarizedExperiment vignette
[@SummarizedExperiment] for more information on this object class.
For the sake of completeness, it is advised to include the design matrix formula in the SummarizedExperiment as indicated below.
sumExp <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = Tasic_counts_vignette), colData = Tasic_metadata_vignette, rowData = txInfo ) # Alternatively, use a SingleCellExperiment as input object # sumExp <- SingleCellExperiment::SingleCellExperiment( # assays = list(counts = Tasic_counts_vignette), # colData = Tasic_metadata_vignette, # rowData = txInfo # ) # for sake of completeness: specify design formula from colData metadata(sumExp)$formula <- ~ 0 + as.factor(colData(sumExp)$group) sumExp
The fitDTU
function of satuRn
is used to model transcript usage in different
groups of samples or cells. Here we adopt the default settings of the function.
Without parallelized execution, this code runs for approximately 15 seconds
on a 2018 macbook pro laptop.
system.time({ sumExp <- satuRn::fitDTU( object = sumExp, formula = ~ 0 + group, parallel = FALSE, BPPARAM = BiocParallel::bpparam(), verbose = TRUE ) })
The resulting model fits are now saved into the rowData
of
our SummarizedExperiment object under the name fitDTUModels
.
These models can be accessed as follows:
rowData(sumExp)[["fitDTUModels"]]$"ENSMUST00000037739"
The models are instances of the StatModel
class as defined in
the satuRn
package. These contain all relevant information for
the downstream analysis. For more details, read the StatModel documentation
with ?satuRn::StatModel-class
.
Here we test for differential transcript usage between select groups of interest. In this example, the groups of interest are the three different cell types that are present in the dataset associated with this vignette.
First, we set up a contrast matrix. This allows us to test for differential
transcript usage between groups of interest. The group
factor in this
toy example contains three levels; (1) ALM.L5_IT_ALM_Tmem163_Dmrtb1,
(2) ALM.L5_IT_ALM_Tnc, (3) VISp.L5_IT_VISp_Hsd11b1_Endou.
Here we show to assess DTU between cells of the groups 1 and 3
and between cells of groups 2 and 3.
The contrast matrix can be constructed manually;
group <- as.factor(Tasic_metadata_vignette$group) design <- model.matrix(~ 0 + group) # construct design matrix colnames(design) <- levels(group) L <- matrix(0, ncol = 2, nrow = ncol(design)) # initialize contrast matrix rownames(L) <- colnames(design) colnames(L) <- c("Contrast1", "Contrast2") L[c("VISp.L5_IT_VISp_Hsd11b1_Endou","ALM.L5_IT_ALM_Tnc"),1] <-c(1,-1) L[c("VISp.L5_IT_VISp_Hsd11b1_Endou","ALM.L5_IT_ALM_Tmem163_Dmrtb1"),2] <-c(1,-1) L # contrast matrix
This can also be done automatically with the makeContrasts
function of
the limma
R package.
group <- as.factor(Tasic_metadata_vignette$group) design <- model.matrix(~ 0 + group) # construct design matrix colnames(design) <- levels(group) L <- limma::makeContrasts( Contrast1 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tnc, Contrast2 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tmem163_Dmrtb1, levels = design ) L # contrast matrix
Next we can perform differential usage testing using testDTU
. We again adopt
default settings. For more information on the parameter settings, please consult
the help file of the testDTU
function.
sumExp <- satuRn::testDTU( object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = FALSE, forceEmpirical = FALSE )
When set to TRUE, the diagplot1
and diagplot2
arguments generate a
diagnostic plot.
For diagplot1
, the histogram of the z-scores (computed from p-values) is
displayed using the locfdr function of the locfdr
package. The blue dashed
curve is fitted to the mid 50% of the z-scores, which are assumed to originate
from null transcripts, thus representing the estimated empirical null component
densities. The maximum likelihood estimates (MLE) and central matching estimates
(CME) of this estimated empirical null distribution are given below the plot.
If the MLE estimates for delta and sigma deviate from 0 and 1 respectively, the
downstream inference will be influenced by the empirical adjustment implemented
in satuRn (see below).
For diagplot2
, a plot of the histogram of the "empirically adjusted" test
statistics and the standard normal distribution will be displayed. Ideally,
the majority (mid portion) of the adjusted test statistics should follow
the standard normal. If this is not the case, the inference may be
untrustworthy and results should be treated with care. One potential solution
is to include (additional) potential covariates in the analysis.
The test results are now saved into the rowData
of our SummarizedExperiment
object under the name fitDTUResult_
followed by the name of the contrast
of interest (i.e. the column names of the contrast matrix).
The results can be accessed as follows:
head(rowData(sumExp)[["fitDTUResult_Contrast1"]]) # first contrast
head(rowData(sumExp)[["fitDTUResult_Contrast2"]]) # second contrast
The results will be, for each contrast, a dataframe with 8 columns:
estimates
: The estimated log-odds ratios (log base e). In the most simple
case, an estimate of +1 would mean that the odds of picking that transcript from
the pool of transcripts within its corresponding gene is exp(1) = 2.72 times
larger in condition B than in condition A.se
: The standard error on this estimate.df
: The posterior degrees of freedom for the test statistic.t
: The student's t-test statistic, computed with a Wald test given
estimates
and se
.pval
: The "raw" p-value given t
and df
.FDR
: The false discovery rate, computed using the multiple testing
correction of Benjamini and Hochberg on pval
.empirical_pval
: An "empirical" p-value that is computed by estimating the
null distribution of the test statistic empirically. For more details, see our
publication.empirical_FDR
: The false discovery rate, computed using the multiple testing
correction of Benjamini and Hochberg on pval_empirical
.!Note: based on the benchmarks in our publication, we always recommend using
the empirical p-values from column 7 over the "raw" p-value from column 5.
When the MLE estimates for the mean and standard deviation (delta and sigma) of
the empirical null density (blue dashed curve in diagplot1
) deviate from 0 and
1 respectively, there will be a discrepancy between the "raw" and "empirically
adjusted" p-values. A deviation in the standard deviation only affects the
magnitude of the p-values, whereas a deviation in the mean also affects the
ranking of transcripts according to their p-value.
Finally, we may visualize the usage of the top 3 differentially used transcripts
in selected treatment groups. By the setting the transcripts
and
genes
arguments to NULL
and specifying top.n = 3
, the 3 features with the
smallest (empirically correct) false discovery rates are displayed.
Alternatively, visualizing transcripts of interest or all transcripts within a
gene of interest is possible by specifying the transcripts
or genes
arguments, respectively.
group1 <- colnames(sumExp)[colData(sumExp)$group == "VISp.L5_IT_VISp_Hsd11b1_Endou"] group2 <- colnames(sumExp)[colData(sumExp)$group == "ALM.L5_IT_ALM_Tnc"] plots <- satuRn::plotDTU( object = sumExp, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(0, 0, 1), c(0, 1, 0)), summaryStat = "model", transcripts = NULL, genes = NULL, top.n = 3 ) # to have same layout as in our paper for (i in seq_along(plots)) { current_plot <- plots[[i]] + scale_fill_manual(labels = c("VISp", "ALM"), values = c("royalblue4", "firebrick")) + scale_x_discrete(labels = c("Hsd11b1_Endou", "Tnc")) print(current_plot) }
satuRn
returns transcript-level p-values for each of the specified contrasts.
While we have shown that satuRn
is able to adequately control the false
discovery rate (FDR) at the transcript level [@Gilis2021],
[@VandenBerge2017] argued that it is often desirable to control the FDR at
the gene level. This boosts statistical power and eases downstream
biological interpretation and validation, which typically occur at
the gene level.
To this end, [@VandenBerge2017] developed a testing procedure that is
implemented in the BioConductor R package stageR
. The procedure consists of
two stages; a screening stage and a confirmation stage.
In the screening stage, gene-level FDR-adjusted p-values are computed, which aggregate the evidence for differential transcript usage over all transcripts within the gene. Only genes with an FDR below the desired nominal level are further considered in the second stage. In the confirmation stage, transcript-level p-values are adjusted for those genes, using a FWER-controlling method on the FDR-adjusted significance level.
In its current implementation, stageR
can only perform stage-wise testing if
only one contrast is of interest in a DTU setting. An analogous correction for
the assessment of multiple contrasts for multiple transcripts per gene
has not yet been implemented.
Below, we demonstrate how the transcript-level p-values for the first contrast
as returned by satuRn
can be post-processed using stageR
.
We rely on the perGeneQValue
function:
# transcript level p-values from satuRn pvals <- rowData(sumExp)[["fitDTUResult_Contrast1"]]$empirical_pval # compute gene level q-values geneID <- factor(rowData(sumExp)$gene_id) geneSplit <- split(seq(along = geneID), geneID) pGene <- sapply(geneSplit, function(i) min(pvals[i])) pGene[is.na(pGene)] <- 1 theta <- unique(sort(pGene)) # gene-level significance testing q <- DEXSeq:::perGeneQValueExact(pGene, theta, geneSplit) qScreen <- rep(NA_real_, length(pGene)) qScreen <- q[match(pGene, theta)] qScreen <- pmin(1, qScreen) names(qScreen) <- names(geneSplit) # prepare stageR input tx2gene <- as.data.frame(rowData(sumExp)[c("isoform_id", "gene_id")]) colnames(tx2gene) <- c("transcript", "gene") pConfirmation <- matrix(matrix(pvals), ncol = 1, dimnames = list(rownames(tx2gene), "transcript") ) # create a stageRTx object stageRObj <- stageR::stageRTx( pScreen = qScreen, pConfirmation = pConfirmation, pScreenAdjusted = TRUE, tx2gene = tx2gene ) # perform the two-stage testing procedure stageRObj <- stageR::stageWiseAdjustment( object = stageRObj, method = "dtu", alpha = 0.05, allowNA = TRUE ) # retrieves the adjusted p-values from the stageRTx object padj <- stageR::getAdjustedPValues(stageRObj, order = TRUE, onlySignificantGenes = FALSE ) head(padj)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.