knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=FALSE, error=FALSE, warning=TRUE)
NBSplice
is an R package sitting on Bioconductor. You could download the
source file from the Bioconductor site (http://bioconductor.org/) and
then use R CMD install
. Another easy way to install it is by doing:
## try http:// if https:// URLs are not supported if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("NBSplice", version="devel")
Here we show the most basic steps for a differential splicing analysis
pipeline using the NBSplice
R package.
To start the analysis, an expression matrix at the isoform/transcript
level called isoCounts
, previously generated by upstream processing
tools like Kallisto [@bray2016near] or RSEM [@li2011rsem], is required.
You also need a table describing isoform-gene relationships called geneIso
and a table of experiment information called designMatrix
. In particular,
this table must contain a column storing the experiment factor which
will be evaluated. Here, we assume that this column is called condition
.
As input, the NBSplice
package expects isoform expression counts obtained
from RNA-seq in the form of a matrix of numerical values. In this matrix, the
count value found in the i-th row and the j-th column represents the
expression counts assigned to isoform i in experimental sample j. The
values in the matrix should be un-normalized. We suggest the use of the
effective counts typically reported by the mentioned quantification tools. It
is not necessary to round those because NBSplice
will do it internally.
The expression matrix could be easily obtained by the combination of
quantification files for all samples. For instance, if Kallisto was used to
transcript quantification, the expression matrix should be built combining the
est_counts
columns of those files as is shown in the code below.
fileNames<-c(paste(rep("C1R",4), 1:4,"/abundance.tsv", sep=""), paste(rep("C2R",4), 1:4,"/abundance.tsv", sep="")) myInfo<-lapply(seq_along(fileNames), function(x){ quant<-read.delim(fileNames[x], header = TRUE) expression<-quant[,"est_counts"] return(expression) }) # Adding the isoform IDS as row names isoform_id<-read.delim(fileNames[x], header = TRUE)[,"target_id"] expressionMatrix<-(do.call(cbind, myInfo)) rownames(expressionMatrix)<-isoform_id # Adding the samples names as column names colnames(expressionMatrix)<-c(paste(rep("C1R",4), 1:4, sep=""), paste(rep("C2R",4), 1:4, sep=""))
If another quantification tool was used, you only need to change the name
of the quantification files and the name of the columns having estimated counts
and isoforms IDs. For instance, for RSEM quantification, isoforms counts and
IDS are in the expected_count
and transcript_id
columns of the
Sample.isoforms.results
file, respectively.
As an example, we have included
an isoCounts
matrix which is a subset of one of the simulated matrix used to
demonstrate NBSplice
utility.
data(isoCounts, package="NBSplice") head(isoCounts) dim(isoCounts)
As can be noted, the isoCounts
matrix represents an experiment with
8
samples were 3122
isoforms were quantified.
The geneIso
matrix must have two columns called isoform_id
and gene_id
and as many rows as the isoCounts
matrix. The i-th row of the geneIso
matrix specifies from which gene was originated the i-th isoform. This
information will be used by NBSplice
methods for model fitting and hypothesis
tests.
This matrix could be easily obtained from the annotation file used for isoform
quantification. For instance, if RSEM was used, the matrix can be built taking
the first two columns (transcript_id
and gene_id
) of the
Sample.isoforms.results
file. In the case of Kallisto, this information is
not stored in the quantification file. You could use the biomaRt
R package
to obtain gene IDs for the quantified isoforms or the refGenome
package
to extract those from the annotation file.
As an example, we provided a geneIso
matrix associated with the
isoCounts
object previously loaded.
data(geneIso, package="NBSplice") head(geneIso) dim(geneIso)
This matrix contains annotation information, so, you need to build it everytime that you change your reference transcriptome, downloading the respective information from the database site.
The designMatrix
table must contain experiment information relevant for the
differential splicing analysis. Its number of columns is variable but it should
have as rows as columns the isoCounts
matrix has. Thus, in the designMatrix
the i-th row specifies experimental information associated to the i-th
sample. It is worth to mention that the order and names of isoCounts
columns
must be the same of the designMatrix
rows. Currently, NBSplice
allows
two-levels comparison of single-factor experiments (multiple-factor experiment
comparisons are in development). In order to do this, it is necessary to
specify the argument colName
which indicates the name of the designMatrix
column corresponding to that factor. We provided the
designMatrix
object associated to our isoCounts
matrix which could be
loaded doing data(designMatrix, package="NBSplice")
. Alternatively, this
matrix could be created using the code below.
designMatrix<-data.frame(sample=c(paste(rep("C1R", 4), 1:4, sep=""), paste(rep("C2R", 4), 1:4, sep="")), condition=factor(c( rep("Normal", 4), rep("Tumor", 4)), levels=c("Normal", "Tumor"))) rownames(designMatrix)<-designMatrix[,"sample"] designMatrix
In our case, we must define our colName
argument indicating that our factor
of interest is condition.
colName<-"condition" levels(designMatrix[,colName])
As can be noted, our experiment involved two experimental conditions: Normal
and Tumor. Note that this column of the designMatrix
is a factor
with two
levels where the first level, commonly taken as the reference, is Normal.
NBSplice
incorporates a new class called IsoDataSet
to store expression
counts and experimental data. An object of this class has the next slots:
counts
, geneCounts
, colData
, isoGeneRel
, design
and lowExpIndex
.
The first slot will contain the isoCounts
matrix, whereas, the second one
store the expression counts at the gene level. To do this, NBSplice
incorporates a function which takes every gene of geneIso
and, for each
sample, compute its total expression as the sum of the counts of all its
isoforms contained in isoCounts
. The third and fifth slots correspond to
designMatrix
and geneIso
objects. The slot called design
save the formula
used to model fitting and hypothesis tests, and is internally defined by
NBSplice
based on the colName
argument. The last argument, lowExpIndex
,
stores the indexes of low-expressed isoforms. The way to define those isoforms
is explained in [Identifying low-expressed isoforms].
The typical way to build an IsoDataSet
object is using its constructor which
have the same name as the class. In addition to the four arguments previously
defined, it is possible to define the BPPARAM
argument which must be a
BiocParallelParam
instance defining the parallel back-end to be used
[@biocparallel].
library(NBSplice) myIsoDataSet<-IsoDataSet(isoCounts, designMatrix, colName, geneIso)
NBSplice
also provides several methods for easy object and slots inspections.
show(myIsoDataSet) head(isoCounts(myIsoDataSet))
Note that the isoCounts
method returns the counts
slot but it seems to be
different to the IsoCounts
object previously loaded. This is because this
method has an additional logical parameter, CPM
with default value TRUE
,
useful to indicate if expression counts should be returned in
counts per million (CPM) scale.
Low expression transcripts have been widely recognized as a source of noise in
model fitting and parameter estimation of RNA-seq expression counts. Typically,
count thresholds are used to decide if a gene or an isoform has low
expression. However, NBSplice
assumes that an isoform has a low expression if
their absolute or relative expression is lower than specifics thresholds. This
assumption is due to the package model infers differential splicing by means of
changes in relative expression of isoforms. Consequently, a transcript which
has a very low relative expression in both conditions could exhibit a small
change but enough to be erroneously detected as significant. The
buildLowExpIdx
method is responsible for the identification of low-expressed
isoforms. It takes an IsoDataSet
object and two expression thresholds
and creates a low-expression index and stores this index in the lowExpIndex
object slot. The two thresholds are called ratioThres
and countsThres
. The
first one of those is used to set a minimum of relative expression that each
isoform should achieve in all experimental samples. The second one indicates a
threshold in CPM for mean expression across conditions. The method also needs
the specification of the colName
argument, in order to compute mean
expressions per factor level, and admits parallelization using the BPPARAM
argument.
myIsoDataSet<-buildLowExpIdx(myIsoDataSet, colName, ratioThres = 0.01, countThres = 1)
Once an IsoDataSet
was built and its low-expressed isoforms were identified,
you could use the NBSplice
for differential splicing evaluation using the
NBTest
method. It provides a user-friendly interface between model fitting
and hypothesis testing and the user by means a single code line. The theory
behind the NBSplice
model is explained in [NBSplice strategy]. Briefly,
NBTest
iterates along genes obtaining significance p-values at gene and
isoform level indicating the occurrence of differential splicing and
significant changes in isoform proportion, respectively. Then, isoform and gene
p-values are adjusted. Low-expressed isoforms are ignored for model fitting but
their mean relative expression per condition are also computed. The method
requires the specification of the colName
, as well as the statistic assumed
for hypothesis test p-values distribution. As the colName
factor could have
more than two levels, it is necessary the specification of an additional
argument defining the two levels to be contrasted. By default, the contrast
argument takes the two first levels of colName
. NBTest
returns an
NBSpliceRes
object which stores the obtained results.
myDSResults<-NBTest(myIsoDataSet, colName, test="F")
data(myDSResults, package="NBSplice")
NBSpliceRes
is another new class provided by NBSplice to facilitate the
manipulation and exploration of the differential splicing results obtained
using the NBTest
method. An object of this class has three slots called
results
, lowExpIndex
and contrast
. The first slot stores differential
splicing results for all isoforms, whereas the second slot keeps the indexes of
low-expressed isoforms and the third one contains the evaluated contrast.
Specific methods for slot accession are also provided by NBSplice
head(lowExpIndex(myDSResults)) contrast(myDSResults) head(results(myDSResults))
The data.frame
containing differential splicing results has the next columns:
iso
: Isoform name.gene
: Gene name.ratio_X
: Relative expression of the isoform in experimental condition
X, which is the first value specified with the contrast
argument.ratio_Y
: Relative expression of the isoform in experimental condition
Y, which is the second value specified with the contrast
argument.odd
: Ratio between isoform's relative expression in condition X and
Y.stat
: Statistic of the hypothesis test performed to decide if the change
in isoform's relative expression between X and Y conditions was
significant.pval
: P-value at the isoform level corresponding to the statistic stat
.genePval
: P-value at the gene level.FDR
: Adjusted p-value at the isoform level.genePval
: Adjusted p-value at the gene level.In particular, the pval
or its adjusted value, FDR
, is useful for deciding
if an isoform experimented significant change in its relative expression due to
the change between X and Y conditions, here called Normal and Tumor.
Whereas, the genePval
or its adjusted value, geneFDR
, will help you to
decide if a gene evidenced significant changes in its alternative splicing
pattern when conditions X and Y were contrasted.
In the example illustrated above, non-reliable isoforms were
discarded by means of the filter
argument of the results
methods which has
TRUE
as its default value. Non-reliable refers to both low expressed isoforms
and those cases where model estimations did not achieve convergence. To obtain
the full results
matrix only you need specifies that filter
argument is
FALSE
.
head(results(myDSResults, filter = FALSE))
Non-reliable isoforms could be easily detected in the obtained matrix because
they present NA
value in all parameters derived from model fitting and
hypothesis test. It is worth to note that although non-reliable isoforms were
not analyzed by statistical models, they have been used to compute the total
gene counts so in order to avoid overestimation of relative expression of the
well-expressed isoforms.
The NBSpliceRes
class has several methods useful for differential splicing
results exploration.
NBSplice
assumes a gene is differentially spliced if its p-value (raw or
adjusted) is lower than a significant threshold (default value = 0.05). The
results table only for those differentially spliced genes is easily extracted
using the GetDSResults
method. Whereas, the names of those genes is obtained
by means of the GetDSGenes
. Both methods expect at least one argument of
class NBSpliceRes
. In addition, they could receive a logical parameter,
adjusted
, indicating if p-values must be adjusted (default) or not, and a
numeric parameter, p.value
, establishing the significance threshold (default
value = 0.05).
mySignificantRes<-GetDSResults(myDSResults) head(mySignificantRes) dim(mySignificantRes) myDSGenes<-GetDSGenes(myDSResults) head(myDSGenes) length(myDSGenes)
As can be seen, in our example, NBSplice
detected 184 isoforms with a
significant change of relative expression between Normal and Tumor
condition whereas 40 genes were identified as differentially spliced.
The scatter plot of isoforms' relative expression between the two contrasted
conditions is explored using the plotRatiosDisp
. It returns a ggplot2
object
which could be then easily modified and customized.
plotRatiosDisp(myDSResults)
As can be noted, isoforms are colored according to if they came from those
genes detected as differentially spliced or not. To do this, the
adjusted
and p.value
parameters, previously used to call the
getDSResults
method, could be specified in the plotRatiosDisp
calling.
The volcano plot is commonly used to explore the dispersion between statistical
significance from a statistical test with the magnitude of the change. In the
case of NBSplice
the isoform p-value and the difference between isoform's
relative expression per contrasted condition are explored using the
plotVolcano
method. In addition, isoforms are colored according to if they
came from those genes detected as differentially spliced or not.
plotVolcano(myDSResults)
NBSplice
also provides two methods to individually explore differentially
spliced genes: GetGeneResults
and plotGeneResults
. The first one is useful
for obtaining the differential expression results of a single gene, specified
using the gene
argument. The second one is a graphical method to generate a
bar-plot illustrating the isoforms' relative expression values in the two
contrasted conditions. In both cases, it is possible filtering those
non-reliable isoforms by means of the filterLowExpIso
argument. In the case
of the plotGeneResults
method, non-significant isoforms could be also
filtered out setting the filterNotSignificant
argument to TRUE
which is its
default value. As an example, the results obtained for the first analyzed gene
are shown below.
## Select the first differentially spliced gene gene<-GetDSGenes(myDSResults)[1] GetGeneResults(myDSResults, gene) plotGeneResults(myDSResults, gene)
## Keeping non-reliable and non-significant isoforms plotGeneResults(myDSResults, gene, filterLowExpIso = FALSE, filterNotSignificant = FALSE)
NBSplice
, is a package developed for differential splicing detection starting
from isoform quantification data from RNA-seq experiments. Our tool is based on
generalized linear models (GLMs) and uses them to infer significant differences
in isoforms relative expression values between two experimental conditions. In
addition, NBSplice
not only reveals what genes were under DS but also reveals
the identity of the gene isoforms affected by it as well as their relative
expression in each experimental condition.
NBSplice
is based on the use of GLMs to model isoform expression counts. This
approach has been widely used for several differential expression tools like
edgeR
[@robinson2010edger] and DESeq2
[@love2014moderated] for differential
gene expression and DEXSeq
[@anders2012detecting] for differential exon usage
analysis. These tools are based in negative binomial (NB) models and, in
general, they assume that the counts of the i-th gene/exon in the k-th
sequenced sample, denoted as $y_{ik}$, are a random variable with a NB
distribution.
\begin{equation} y_{ik}\sim NB(\mu_{ik}=p_{ik}s_k, \phi_{i}) \end{equation}
This NB distribution is characterized by two parameters: the mean, $\mu_{ik}$, and the dispersion, $\phi_{i}$. Specifically, previously mentioned R packages assume that $\mu_{ik}$ can be expressed as the product of the real proportion of mRNA fragments of the i-th gene in the k-th sample, $p_{ik}$, with a factor, $s_k$, related to the sequencing library size. The parameter $\phi_{i}$ is directly related to the variance of $y_{ik}$
\begin{equation} V(y_{ik})=\mu_{ik}+\phi_i\mu_{ik} \label{EQ:var} \end{equation}
Commonly used methods fit GLMs with a logarithmic link and with the elements $x_{kr}$ of the design matrix to infer differential expression
\begin{equation} log(p_{ik})=\sum\limits_rx_{kr}\beta_{ir} \label{EQ:mod} \end{equation}
In the most simple case, a treatment-control experiment, $r$ be equal to $1$ and the design matrix only will have one column where each cell indicates if sample $k$ was or not treated.
NBSplice
takes the advantage of the previously described strategy although
it proposes an alternative approach. It assumes that isoform expression counts,
in CPMs, of the i-th isoform from the j-th gene in the k-th sample, $y_{ijk}$,
follow a NB distribution as
\begin{equation} y_{ijk}\sim NB(media=p_{ijk}\mu_{jk}, dispersion=\phi_{j}) \label{EQ:eq5} \end{equation}
where $p_{ijk}$ represents the proportion of the mRNA fragments from the j-th gene ($\mu_{jk}$) belonging to the i-th isoform and the $\phi_{j}$ is the dispersion of the j-th gene, in the k-th sample. In this context, for each isoform, it is possible to predict its mean relative expression or proportion by means of a GLM fit at the level gene according to
\begin{equation} log(p_{ijk})=x_{r.}\beta_{ij} \label{EQ:eq6} \end{equation}
where $\beta_{ij}$ represents the change in $p_{ijk}$ due to the effect
described by the r-th column of the design matrix X ($x_{r.}$). Thus,
NBSplice
proposes, unlike other tools such as DEXSeq
or
Limma
[@ritchie2015limma], the adjustment of one model for each gene. In
particular, this model must consider the different gene isoforms as effects. In
the most simple case, where only one factor (cond) with two levels
($l=1, ~2$) is analyzed , the previous equation is summarized at:
\begin{equation} ln(p_{ijl})=\mu_0+iso_{ij}+cond_l+ iso_{ij}:cond_l \label{EQ:eq7} \end{equation}
Once coefficient models are estimated, $\hat{\beta}$, NBSplice
evaluates if
the changes in isoforms proportions due to experimental conditions are
significant by means of a linear hypothesis test (Wald test) based on the
F-statistic defined below [@greene2017econometric].
\begin{equation} H_0:cond_l+ iso_{ij}:cond_l=0 \ F=\frac{(H\hat{\beta}-q)'(H \hat{cov}(\hat{\beta})H')^{-1}(H\hat{\beta}-q)}{p} \end{equation}
Here, $H$ is the contrast matrix, $q$ is a single element matrix having the value of the right side of the test equation ($q=0$), $\hat{cov}(\hat{\beta})$ is the covariance matrix of $\hat{\beta}$ and $p$ is equal to the number of rows of $H$ ($p=1$). In particular, $H$ is a full rank matrix, conformable with $\hat{\beta}$ so, when they are multiplied, the linear combination of the left side of the test equation is defined. Thus, under the null hypothesis, the F statistics will have $F$ distribution with $1$ and $n-K$ degrees of freedom. Specifically, $n$ is the number of samples and $K$ is the length of $\beta$. When only one experimental factor with two levels is analyzed (treatment-control case), $K$ will be equal to the double of the number of analyzed isoforms of gene $j$, $I_j$. The comparison of the F statistics against its distribution under null hypothesis provides a p-value for each isoform of gene $j$.
Given the fact that DS affect the whole gene, once isoform hypothesis tests
were performed, NBSplice
uses the Simes test [@simes1986improved] to combine
isoform p-values obtaining a gene p-value. Given a family of null hypothesis,
$H_{01}, ..., H_{0I_j}$, to test changes in relative isoform expression of the
$I_j$ isoforms from gene $j$, the Simes test evaluates only one global null
hypothesis
\begin{equation} H_0=\bigcap\limits_{p=1}^{I_j}H_{0i} \end{equation}
To do this, the isoform p-values are sorted $p-value_{(1)j}< \dots < p-value_{(m)j}< \dots <p-value{(I)j}$ and then the Simes statistic, $T_j$, is defined as
\begin{equation} T_j=\min { \frac{I_j p-value{(m)j}}{m}} \label{EQ:eq10} \end{equation}
Under the null hypothesis, this statistic has uniform distribution, $U(0,1)$, representing a p-value. The Simes test rejects $H_0$ if $T_j$ is lower or equal to the significance threshold $\alpha$ [@simes1986improved]. Finally, isoform and gene p-values are corrected using the FDR strategy [@benjamini1995controlling].
NBSplice
uses commonly used methods and R packages for classical GLM fit
[@mccullagh1984generalized]. In particular, the glm.nb
function of the MASS
package [@MASS] is used to model fitting. It was chosen because this function
allows the estimation of the dispersion parameter reporting a logical parameter
indicating if the model estimation converged or not. This parameter is used by
NBSplice
because if estimations do not converge, thus inferences over the
corresponding genes won't be done. In addition, the lht
function of the car
package [@car] was used for linear hypothesis tests and the simes.test
method
from mppa
package [@mppa], for the Simes test.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.