library(knitr)
satuRn
is an R package to perform differential transcript usage analyses in bulk and single-cell transcriptomics datasets. The package has three main functions. The first function, fitDTU
, is used to model transcript usage profiles by means of a quasi-binomial generalized linear model. Second, the testDTU
function tests for differential usage of transcripts between certain groups of interest (e.g. different treatment groups or cell types). Finally, the plotDTU
can be used to visualize the usage profiles of selected transcripts in different groups of interest. All details about the satuRn
model and statistical tests are described in our preprint [...].
In this vignette, we analyze a small subset of the data from [@Tasic2018]. More specifically, an expression matrix and the corresponding metadata of the subset data has been provided with the satuRn
package. We will adopt this dataset to showcase the different functionalities of satuRn
.
For the moment, satuRn
is only available from our GitHub page https://github.com/statOmics/satuRn. We hope to make a submission to the Bioconductor project soon.
#devtools::install_github("statOmics/satuRn") #test when no longer private
library(satuRn) library(AnnotationHub) library(ensembldb) library(edgeR) library(SummarizedExperiment) library(ggplot2)
The following data corresponds to a small subset of the dataset from [@Tasic2018] and is readily available from the satuRn
package. To check how the subset was generate, please check ?Tasic_counts_vignette
.
data(Tasic_counts_vignette) # transcript expression matrix data(Tasic_metadata_vignette) # metadata
We start the analysis from scratch, in order to additionally showcase some of the prerequisite steps for performing a DTU analysis.
First, we need an object that links the transcripts the expression matrix to their corresponding genes. We suggest using AnnotationHub
and ensembldb
for this purpose.
ah <- AnnotationHub() # load the annotation resource. all <- query(ah, "EnsDb") # query for all available EnsDb databases ahEdb <- all[["AH75036"]] # for Mus musculus (choose correct release date) txs <- transcripts(ahEdb)
Next, we perform some data wrangling steps to get the data in a format that is suited for satuRn. First, we create a DataFrame
or Matrix
linking transcripts to their corresponding genes.
! Important: satuRn
is implemented such that the columns with transcript identifiers is names isoform_id
, while the column containing gene identifiers should be named gene_id
. In addition, following chunk removes transcripts that are the only isoform expressed of a certain gene, as they cannot be used in a DTU analysis.
# Get the transcript information in correct format txInfo <- as.data.frame(matrix(data = NA, nrow = length(txs), ncol = 2)) colnames(txInfo) <- c("isoform_id", "gene_id") txInfo$isoform_id <- txs$tx_id txInfo$gene_id <- txs$gene_id rownames(txInfo) <- txInfo$isoform_id # Remove transcripts that are the only isoform expressed of a certain gene rownames(Tasic_counts_vignette) <- sub("\\..*", "", rownames(Tasic_counts_vignette)) txInfo <- txInfo[txInfo$isoform_id %in% rownames(Tasic_counts_vignette), ] txInfo <- subset(txInfo, duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE)) Tasic_counts_vignette <- Tasic_counts_vignette[which(rownames(Tasic_counts_vignette) %in% txInfo$isoform_id), ]
Here we perform some feature-level filtering. For this task, we adopt the filtering criterium that is implemented in the R package edgeR
. Alternatively, one could adopt the dmFilter
criterium from the DRIMSeq
R package, which provides a more stringent filtering when both methods are run in default settings. After filtering, we again remove transcripts that are the only isoform expressed of a certain gene.
filter_edgeR <- filterByExpr(Tasic_counts_vignette, design = NULL, group = Tasic_metadata_vignette$brain_region, lib.size = NULL, min.count = 10, min.total.count = 30, large.n = 20, min.prop = 0.7 ) # more stringent than default to reduce run time of the vignette table(filter_edgeR) Tasic_counts_vignette <- Tasic_counts_vignette[filter_edgeR, ] # Update txInfo according to the filtering procedure txInfo <- txInfo[which(txInfo$isoform_id %in% rownames(Tasic_counts_vignette)), ] # remove transcripts that are the only isoform expressed of a certain gene (after filtering) txInfo <- subset(txInfo, duplicated(gene_id) | duplicated(gene_id, fromLast = TRUE)) Tasic_counts_vignette <- Tasic_counts_vignette[which(rownames(Tasic_counts_vignette) %in% txInfo$isoform_id), ] # satuRn requires the transcripts in the rowData and the transcripts in the count matrix to be in the same order. txInfo <- txInfo[match(rownames(Tasic_counts_vignette), txInfo$isoform_id), ]
Here we set up the design matrix of the experiment. The subset of the dataset from [@Tasic2018] contains cells of several different cell types (variable cluster
) in two different areas of the mouse neocortex (variable brain_region
). As such, we can model the data with a factorial design, i.e. by generating a new variable group
that encompasses all different cell type - brain region combinations.
Tasic_metadata_vignette$group <- paste(Tasic_metadata_vignette$brain_region, Tasic_metadata_vignette$cluster, sep = ".")
All three main functions of satuRn
require a SummarizedExperiment
object as an input class. See the SummarizedExperiment vignette [@SummarizedExperiment] for more information on this object class.
Do not forget to include the design matrix formula (see above) to the SummarizedExperiment as indicated below. As such, the object contains all the information required for the downstream DTU analysis.
sumExp <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = Tasic_counts_vignette), colData = Tasic_metadata_vignette, rowData = txInfo ) metadata(sumExp)$formula <- ~ 0 + as.factor(colData(sumExp)$group) # specify design formula from colData sumExp
The fitDTU
function of satuRn
is used to model transcript usage in different groups of samples or cells. Here we adopt the default settings of the function. Without parallel execution, this code runs for approximately 15 seconds on a 2018 macbook pro laptop.
Sys.time() sumExp <- satuRn::fitDTU( object = sumExp, formula = ~0+group, parallel = FALSE, BPPARAM = BiocParallel::bpparam(), verbose = TRUE ) Sys.time()
The resulting model fits are now saved into the rowData
of our SummarizedExperiment object under the name fitDTUModels
. These models can be accessed as follows:
rowData(sumExp)[["fitDTUModels"]]$"ENSMUST00000037739" rowData(sumExp)[["fitDTUModels"]][1] # equivalent
The models are instances of the StatModel
class as defined in the satuRn
package. These contain all relevant information for the downstream analysis. For more details, read the StatModel documentation with ?satuRn::StatModel-class
.
Here we test for differential transcript usage between select groups of interest.
First, we set up a contrast matrix. This allows us to test for differential transcript usage between groups of interest. The group
factor in this toy example contains three levels; (1) ALM.L5_IT_ALM_Tmem163_Dmrtb1, (2) ALM.L5_IT_ALM_Tnc, (3) VISp.L5_IT_VISp_Hsd11b1_Endou. Here we show to assess DTU between cells of the groups 1 and 3 and between cells of groups 2 and 3.
group <- as.factor(Tasic_metadata_vignette$group) design <- model.matrix(~ 0 + group) colnames(design) <- levels(group) L <- matrix(0, ncol = 2, nrow = ncol(design)) # initialize contrast matrix rownames(L) <- colnames(design) colnames(L) <- c("Contrast1", "Contrast2") L[c("VISp.L5_IT_VISp_Hsd11b1_Endou", "ALM.L5_IT_ALM_Tnc"), 1] <- c(1, -1) L[c("VISp.L5_IT_VISp_Hsd11b1_Endou", "ALM.L5_IT_ALM_Tmem163_Dmrtb1"), 2] <- c(1, -1) L # final contrast matrix
Next we can perform differential usage testing using testDTU
. We again adopt default settings. For more information on the parameter settings, please fitting the help file of the testDTU function.
sumExp <- satuRn::testDTU(object = sumExp, contrasts = L, plot = FALSE, sort = FALSE)
The test results are now saved into the rowData
of our SummarizedExperiment object under the name fitDTUResult_
followed by the name of the contrast of interest (i.e. the column names of the contrast matrix). The results can be accessed as follows:
head(rowData(sumExp)[["fitDTUResult_Contrast1"]]) # first contrast
head(rowData(sumExp)[["fitDTUResult_Contrast2"]]) # second contrast
Finally, we may visualize the usage of select transcripts in select groups of interest.
group1 <- rownames(colData(sumExp))[colData(sumExp)$group == "VISp.L5_IT_VISp_Hsd11b1_Endou"] group2 <- rownames(colData(sumExp))[colData(sumExp)$group == "ALM.L5_IT_ALM_Tnc"] plots <- satuRn::plotDTU(object = sumExp, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(0, 0, 1), c(0, 1, 0)), summaryStat = "model", transcripts = c("ENSMUST00000081554", "ENSMUST00000195963", "ENSMUST00000132062"), genes = NULL, top.n = 6) # to have same layout as in our paper for (i in seq_along(plots)) { current_plot <- plots[[i]] + scale_fill_manual(labels = c("VISp", "ALM"), values = c("royalblue4", "firebrick")) + scale_x_discrete(labels = c("Hsd11b1_Endou", "Tnc")) print(current_plot) }
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.