Here we show a brief example of using the alpine package to model
bias parameters and then using those parameters to estimate transcript
abundance. We load a metadata table and a subset of reads from four
samples from the GEUVADIS project. For more details on these files,
see ?alpineData
in the alpineData package.
library(knitr) opts_chunk$set(cache=FALSE, error=FALSE)
library(alpineData) dir <- system.file("extdata",package="alpineData") metadata <- read.csv(file.path(dir,"metadata.csv"), stringsAsFactors=FALSE) metadata[,c("Title","Performer","Date","Population")]
A subset of the reads from one of the samples:
library(GenomicAlignments) ERR188297()
Before we start, we need to write these paired-end reads, here stored in a R/Bioconductor data object, out to a BAM file, because the alpine software works with alignments stored as BAM files. This is not a typical step, as you would normally have BAM files already on disk. We write out four BAM files for each of the four samples contained in alpineData. So you can ignore the following code chunk if you are working with your own BAM files.
library(rtracklayer) dir <- tempdir() for (sample.name in metadata$Title) { # the reads are accessed with functions named # after the sample name. the following line calls # the function with the sample name and saves # the reads to `gap` gap <- match.fun(sample.name)() file.name <- file.path(dir,paste0(sample.name,".bam")) export(gap, con=file.name) } bam.files <- file.path(dir, paste0(metadata$Title, ".bam")) names(bam.files) <- metadata$Title stopifnot(all(file.exists(bam.files)))
Now we continue with the typical steps in an alpine workflow. To fit the bias model, we need to identify single-isoform genes. We used the following chunk of code (here not evaluated) to generate a GRangesList of exons per single-isoform gene.
library(ensembldb) gtf.file <- "Homo_sapiens.GRCh38.84.gtf" txdb <- EnsDb(gtf.file) # already an EnsDb txdf <- transcripts(txdb, return.type="DataFrame") tab <- table(txdf$gene_id) one.iso.genes <- names(tab)[tab == 1] # pre-selected genes based on medium to high counts # calculated using Rsubread::featureCounts selected.genes <- scan("selected.genes.txt", what="char") one.iso.txs <- txdf$tx_id[txdf$gene_id %in% intersect(one.iso.genes, selected.genes)] ebt0 <- exonsBy(txdb, by="tx") ebt.fit <- ebt0[one.iso.txs]
Here we pick a subset of single-isoform genes based on the number of exons, and the length. We show in comments the recommended parameters to use in selecting this subset of genes, although here we use different parameters to ensure the building of the vignette takes only a short period of time and does not use much memory.
library(GenomicRanges)
library(alpine) data(preprocessedData) # filter small genes and long genes min.bp <- 600 max.bp <- 7000 gene.lengths <- sum(width(ebt.fit)) summary(gene.lengths) ebt.fit <- ebt.fit[gene.lengths > min.bp & gene.lengths < max.bp] length(ebt.fit) set.seed(1) # better to use ~100 genes ebt.fit <- ebt.fit[sample(length(ebt.fit),10)]
Robust fitting of these bias parameters is best with ~100 medium to
high count genes, e.g. mean count across samples between 200 and
10,000. These counts can be identified by featureCounts from the
Rsubread Bioconductor package, for example.
It is required to specify a minimum and maximum fragment size
which should be lower and upper quantiles of the fragment length
distribution. The minsize
and maxsize
arguments are recommended to be roughly the 2.5% and 97.5% of the
fragment length distribution. This can be quickly estimated using the
helper function getFragmentWidths, iterating over a few
single-isoform genes with sufficient counts:
w <- getFragmentWidths(bam.files[1], ebt.fit[[1]]) c(summary(w), Number=length(w)) quantile(w, c(.025, .975))
It is also required to specify the read length. Currently alpine only supports unstranded, paired-end RNA-seq with fixed read length. Differences of +/- 1 bp in read length across samples can be ignored.
getReadLength(bam.files)
Here we use a very limited range of fragment lengths for speed, but for a real analysis we would suggest using the minimum and maximum of the quantiles computed above across all samples (the minimum of the lower quantiles and the maximum of the upper quantiles).
library(alpine) library(BSgenome.Hsapiens.NCBI.GRCh38) readlength <- 75 minsize <- 125 # better 80 for this data maxsize <- 175 # better 350 for this data gene.names <- names(ebt.fit) names(gene.names) <- gene.names
The following function builds a list of DataFrames which store information about the fragment types from each gene in our training set.
system.time({ fragtypes <- lapply(gene.names, function(gene.name) { buildFragtypes(exons=ebt.fit[[gene.name]], genome=Hsapiens, readlength=readlength, minsize=minsize, maxsize=maxsize, gc.str=FALSE) }) }) print(object.size(fragtypes), units="auto")
We can examine the information for a single gene:
head(fragtypes[[1]], 3)
The definition of bias models is extremely flexible in alpine. The
models
argument should be given as a list, where each element is
model. The model itself should be provided as a list with elements
formula
and offset
. Either formula
or offset
can be set to
NULL
for a given model.
The allowable offsets are fraglen
and/or vlmm
which should be
provided in a character vector.
Offsets are only estimated once for all models, so setting
formula=NULL
only makes sense if extra offsets are desired
which were not already calculated by other models.
Any kind of R formula can be provided to formula
, making use of the
fragment features:
gc
(fragment GC content from 0 to 1)relpos
(fragment midpoint relative position from 0 to 1)GC40.80
, GC40.90
, GC20.80
, GC20.90
(indicator variables
indicating the presence of, e.g. a 40 bp stretch of 80% or higher GC
content within the fragment)These fragment features reference columns of information stored in
fragtypes
. Interactions between these terms and offsets are also
possible, e.g. gc:fraglen
.
Note: It is required to provide formula as character strings, which are converted internally into formula, due to details in how R formula make copies of objects from the environment.
models <- list( "GC" = list( formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene", offset=c("fraglen") ), "all" = list( formula = "count ~ ns(gc,knots=gc.knots,Boundary.knots=gc.bk) + ns(relpos,knots=relpos.knots,Boundary.knots=relpos.bk) + gene", offset=c("fraglen","vlmm") ) )
Here we fit one bias model, GC
, using fragment length, fragment GC
content, relative position, and a term for differences in expression
across the genes (+ gene
).
We fit another bias model, all
, with all the terms of the first but
additionally with read start bias (encoded by a Variable Length Markov
Model, or VLMM).
Note: It is required if a formula is provided that it end with +
gene
to account for differences in base expression levels while
fitting the bias parameters.
The knots and boundary knots for GC content (gc
) and relative
position (relpos
) splines have reasonable default values, but they
can be customized using arguments to the fitBiasModels function.
The returned object, fitpar
, stores the information as a list of
fitted parameters across samples.
system.time({ fitpar <- lapply(bam.files, function(bf) { fitBiasModels(genes=ebt.fit, bam.file=bf, fragtypes=fragtypes, genome=Hsapiens, models=models, readlength=readlength, minsize=minsize, maxsize=maxsize) }) }) # this object saved was 'fitpar.small' for examples in alpine man pages # fitpar.small <- fitpar
Note that with more basepairs between minsize
and maxsize
and with
more genes used for estimation, the bias parameters would be more
precise. As estimated here, the curves look a bit wobbly. Compare to
the curves that are fit in the alpine paper (see citation("alpine")
).
The estimated spline coefficients have high variance from too few
observations (paired-end fragments) across too few genes.
First we set a palette to distinguish between samples
library(RColorBrewer) palette(brewer.pal(8,"Dark2"))
The fragment length distribution:
perf <- as.integer(factor(metadata$Performer)) plotFragLen(fitpar, col=perf)
The fragment GC bias curves:
plotGC(fitpar, model="all", col=perf)
The relative position curves:
plotRelPos(fitpar, model="all", col=perf)
A 0-order version of the VLMM (note that the VLMM that is used in the model includes positions that are 1- and 2-order, so this plot does not represent the final VLMM used in bias estimation or in estimation of abundances).
plotOrder0(fitpar[["ERR188297"]][["vlmm.fivep"]][["order0"]]) plotOrder0(fitpar[["ERR188297"]][["vlmm.threep"]][["order0"]])
A coefficient table for the terms in formula
:
print(head(fitpar[["ERR188297"]][["summary"]][["all"]]), row.names=FALSE)
We pick a subset of genes for estimating transcript abundances. If the gene annotation includes genes with transcripts which span multiple chromosomes or which do not have any overlap and are very far apart, splitGenesAcrossChroms and splitLongGenes, respectively, can be used to split these. For again merging any overlapping transcripts into "genes", the mergeGenes function can be used. Here we use the ENSEMBL gene annotation as is.
The following code chunk is not evaluated but was used to select a few genes for demonstrating estimateAbundance:
one.iso.genes <- intersect(names(tab)[tab == 1], selected.genes) two.iso.genes <- intersect(names(tab)[tab == 2], selected.genes) three.iso.genes <- intersect(names(tab)[tab == 3], selected.genes) set.seed(1) genes.theta <- c(sample(one.iso.genes, 2), sample(two.iso.genes, 2), sample(three.iso.genes, 2)) txdf.theta <- txdf[txdf$gene_id %in% genes.theta,] ebt.theta <- ebt0[txdf.theta$tx_id]
Next we specify the set of models we want to use, referring back by
name to the models that were fit in the previous step. Additionally,
we can include any of the following models: null
, fraglen
, vlmm
,
or fraglen.vlmm
which are the four models that can be fit using only
offsets (none, either or both of the offsets).
model.names <- c("null","fraglen.vlmm","GC")
Here we estimate FPKM-scale abundances for multiple genes and multiple
samples. If lib.sizes
is not specified, a default value of 1e6
is used. estimateAbundance works one gene at a time, where the
transcripts
argument expects a GRangesList of the exons for each
transcript (multiple if the gene has multiple isoforms).
system.time({ res <- lapply(genes.theta, function(gene.name) { txs <- txdf.theta$tx_id[txdf.theta$gene_id == gene.name] estimateAbundance(transcripts=ebt.theta[txs], bam.files=bam.files, fitpar=fitpar, genome=Hsapiens, model.names=model.names) }) })
Each element of this list has the abundances (theta
) and average
bias (lambda
) for a single gene across all samples, all models, and all
isoforms of the gene:
res[[1]][["ERR188297"]][["GC"]] res[[6]][["ERR188297"]][["GC"]]
The extractAlpine function can be used to collate estimates from
across all genes. extractAlpine will scale the estimates such that
the total bias observed over all transcripts is centered at 1. The
estimates produce by estimateAbundance presume a default library size of
1e6, but will be rescaled using the total number of fragments across
genes when using extractAlpine (if this library size rescaling is
not desired, choose divide.out=FALSE
).
mat <- extractAlpine(res, model="GC") mat
If we provide a GRangesList which contains the exons for each
transcript, the returned object will be a SummarizedExperiment.
The GRangesList provided to transcripts
does not have to be in the
correct order, the transcripts will be extracted by name to match the
rows of the FPKM matrix.
se <- extractAlpine(res, model="GC", transcripts=ebt.theta) se
The matrix of FPKM values can be scaled using the median ratio method of DESeq with the normalizeDESeq function. This is a robust method which removes systematic differences in values across samples, and is more appropriate than using the total count which is sensitive to very large abundance estimates for a minority of transcripts.
norm.mat <- normalizeDESeq(mat, cutoff=0.1)
The fragment GC bias which alpine estimates can be used in
downstream simulations, for example in the
polyester Bioconductor
package. All we need to do is to run the plotGC function, but
specifying that instead of a plot, we want to return a matrix of
probabilities for each percentile of fragment GC content. This matrix
can be provided to the frag_GC_bias
argument of simulate_experiment.
We load a fitpar
object that was run with the fragment length range
[80,350] bp.
data(preprocessedData) prob.mat <- plotGC(fitpar, "all", return.type=2) head(prob.mat)
If return.type=0
(the default) the function makes the plot of log
fragment rate over fragment GC content. If return.type=1
the
function returns the matrix of log fragment rate over percentiles of
fragment GC content, and if return.type=2
, the matrix returns
probabilities of observing fragments based on percentiles of fragment
GC content (the log fragment rate exponentiated and scaled to have a
maximum of 1). The matrix returned by return.type=2
is appropriate
for downstream use with polyester.
In the alpine paper, it was shown that models incorporating fragment GC bias can be a better predictor of test set RNA-seq fragment coverage, compared to models incorporating read start bias. Here we show how to predict fragment coverage for a single-isoform gene using a variety of fitted bias models. As with estimateAbundace, the model names need to refer back to models fit using fitBiasModels.
model.names <- c("fraglen","fraglen.vlmm","GC","all")
The following function computes the predicted coverage for one
single-isoform gene. We load a fitpar
object that was run
with the fragment length range [80,350] bp.
fitpar[[1]][["model.params"]][c("minsize","maxsize")]
system.time({ pred.cov <- predictCoverage(gene=ebt.fit[["ENST00000245479"]], bam.files=bam.files["ERR188204"], fitpar=fitpar, genome=Hsapiens, model.names=model.names) })
We can plot the observed and predicted coverage for one of the genes:
palette(brewer.pal(9, "Set1")) frag.cov <- pred.cov[["ERR188204"]][["frag.cov"]] plot(frag.cov, type="l", lwd=3, ylim=c(0,max(frag.cov)*1.5)) for (i in seq_along(model.names)) { m <- model.names[i] pred <- pred.cov[["ERR188204"]][["pred.cov"]][[m]] lines(pred, col=i, lwd=3) } legend("topright", legend=c("observed",model.names), col=c("black",seq_along(model.names)), lwd=3)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.