knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
Global Nuclear Run On and Sequencing (GRO-seq) was developed to comprehensively map transcriptional activity in cells [@Core2008; @Hah2011]. GRO-seq, which provides a genome wide 'map' of the position and orientation of all transcriptionally active RNA polymerases, has become increasingly widely used in recent years because it has numerous advantages compared to alternative methods of transcriptome profiling, such as expression microarrays and RNA-seq. Among these, GRO-seq provides information on instantaneous transactional responses because it detects primary transcription, as opposed to mature, processed mRNA. In addition, because it is independent of RNA polyadenylation, processing, and stability, GRO-seq provides extensive information on the non-coding transcriptome, including primary miRNAs, lincRNAs, enhancer RNAs, and potentially additional, yet undiscovered classes of transcription occurring in cells [@Hah2011; @Hah2013; @Luo2014]. Thus, GRO-seq data provides a complete and instantaneous picture of transcription, and has extensive applications in deciphering the mechanisms of transcriptional regulation.
We have recently developed several important analytical approaches
which make use of GRO-seq data to address new biological questions.
Our pipleline has been packaged and documented,
resulting in the r Rpackage("groHMM")
package for Bioconductor.
Among the more advanced features,
r Rpackage("groHMM")
predicts the boundaries of transcriptional activity
across the genome de novo using a two-state hidden Markov model (HMM).
Our model essentially divides the genome into
"transcribed" and "non-transcribed" regions
in a strand specific manner [@Hah2011].
We also use HMMs to identify the leading edge of Pol II at genes
activated by a stimulus in GRO-seq time course data.
This approach allows the genome-wide interrogation of transcription rates
in cells [@Danko2013].
In addition to these advanced features,
r Rpackage("groHMM")
provides wrapper functions for
counting raw reads [@PagesGRanges]
and creating metagene (averaging) plots.
r Rpackage("groHMM")
takes over all aspects of analysis
after reads have been aligned to a reference genome
with short-read alignment tools.
Although r Rpackage("groHMM")
is tailored towards GRO-seq data,
the same functions and analytical methodologies can, in principal,
be applied to a wide variety of other short read data sets
since the package includes a number of easily usable and extensible
functions for general short read data analysis.
This guide focuses on the most common application of the package.
The r Rpackage("groHMM")
package is available on GitHub and can be
downloaded as follows:
source("https://bioconductor.org/biocLite.R") install.packages("devtools") devtools::install_github( "coregenomics/groHMM", ref="1.99.x", repos=BiocInstaller::biocinstallRepos())
Additional packages are not required to use r Rpackage("groHMM")
,
but they are used to download annotations and evaluate transcripts,
and should be installed for this tutorial.
devtools::install_github( "coregenomics/groHMM", ref="1.99.x", repos=BiocInstaller::biocinstallRepos(), dependencies=TRUE)
In this tutorial we will use example data from
Hah et al. [2011] (GEO accession GSE27463).
This experiment was designed to assess transcriptional changes
following treatment of MCF-7 cells with 17$\beta$ estradiol (E2).
The data include a time course of GRO-seq data following treatment with E2
(i.e., 0, 10, 40 and 160 min.).
Two biological replicates are available for each time point.
BED files were obtained from GEO and lifted over to hg19
using the UCSC r Rpackage("liftOver")
tool.
In order to make the package size more manageable,
we have included data from chromosome 7 only in 0 and 10 min. conditions.
r Rpackage("groHMM")
supports parallel processing
via BiocParallel.
r Rpackage("groHMM")
uses the GRanges
class
from the r Biocpkg("GenomicRanges")
package
to represent a collection of genomic features,
allowing synergy with other useful packages in Bioconductor.
Most of the functions in the package take at least two arguments:
"reads" and "features".
Reads represent the genomic coordinates of a set of mapped short reads.
Features represent a set of genomic coordinates of interest
such as genes, exons, or transcripts.
The example data included in this package can be loaded into R
using the following commands.
if (!is.na(Sys.getenv("TRAVIS", NA))) BiocParallel::register(BiocParallel::SerialParam())
library(groHMM) library(GenomicAlignments) S0mR1 <- as(readGAlignments(system.file( "extdata", "S0mR1.bam", package="groHMM")), "GRanges") S0mR2 <- as(readGAlignments(system.file( "extdata", "S0mR1.bam", package="groHMM")), "GRanges") # Use R1 as R2 S40mR1 <- as(readGAlignments(system.file( "extdata", "S40mR1.bam", package="groHMM")), "GRanges") S40mR2 <- as(readGAlignments(system.file( "extdata", "S40mR1.bam", package="groHMM")), "GRanges") # Use R1 as R2 # Combine replicates S0m <- c(S0mR1, S0mR2) S40m <- c(S40mR1, S40mR2)
library(png) library(grid) library(gridExtra) grid.arrange( rasterGrob(as.raster(readPNG("GROseqData.png"))), rasterGrob(as.raster(readPNG("GROseqHMM.png"))), ncol=2, widths=unit(c(0.55, 0.45), "npc"))
In r Rpackage("groHMM")
,
transcribed regions are detected de novo
using a two-state hidden Markov model (HMM).
The model takes GRO-seq read counts as input across the genome
and divides the genome into "transcribed" and "non-transcribed" state
as shown in Figure \@ref(fig:hmm).
First, a single read set is generated by combining all samples
for each time point.
This combined approach improves sensitivity for transcripts
with low expression levels.
Combined reads are used to train the model parameters
using the Baum-Welch Expectation Maximization (EM) algorithm.
Each strand is modeled separately dividing the genome
into non-overlapping 50 bp windows classified as either state.
Sall <- sort(c(S0m, S40m)) # hmmResult <- detectTranscripts(Sall, LtProbB=-200, UTS=5, threshold=1) # Load hmmResult from the saved previous run load(system.file("extdata", "Vignette.RData", package="groHMM")) txHMM <- hmmResult$transcripts head(txHMM)
The detectTranscripts
function also uses two hold-out parameters.
These parameters,
specified by the arguments LtProbB
and UTS
,
represents the log-transformed transition probability of
switching from transcribed state to non-transcribed state
and variance of the emission probability for reads in the non-transcribed state,
respectively.
Holdout parameters are used to optimize the performance of HMM predictions
on known genes.
Predicted transcripts are evaluated by comparison to known annotations,
making the assumption that GRO-seq transcripts
should largely be in agreement with available annotations.
Two types of error may occur, as described below.
The HMM parameters are evaluated by the sum of the error rates.
The procedure involves collecting a set of
high-confidence reference transcripts.
An annotation dataset can be constructed by
downloading from the UCSC database or, alternatively,
pre-made TranscriptDb objects can be used
if they are available from Bioconductor.
We will use the UCSC knownGene track
and retrieve transcript annotations with
r Biocpkg("GenomicRanges")
package [@CarsonGFeatures].
library(TxDb.Hsapiens.UCSC.hg19.knownGene) kgdb <- TxDb.Hsapiens.UCSC.hg19.knownGene library(GenomicFeatures) # For refseq annotations: # rgdb <- makeTxDbFromUCSC(genome="hg19", tablename="refGene") # saveDb(hg19RGdb, file="hg19RefGene.sqlite") # rgdb <- loadDb("hg19RefGene.sqlite") kgChr7 <- transcripts( kgdb, columns=c("gene_id", "tx_id", "tx_name"), filter=list(tx_chrom="chr7")) seqlevels(kgChr7) <- seqlevelsInUse(kgChr7)
Because annotations do not provide precise cell type-specific expression information, overlapping transcripts must be merged into a single set, in which each annotation represents the 5$'$- and 3$'$-most boundaries of genes. Different isoforms of each gene are collapsed into one using the ENTREZID. These consensus annotations are used for the evaluation of HMM calling.
# Collapse overlapping annotations kgConsensus <- makeConsensusAnnotations(kgChr7) library(org.Hs.eg.db) map <- select( org.Hs.eg.db, keys=unlist(mcols(kgConsensus)$gene_id), columns=c("SYMBOL"), keytype=c("ENTREZID")) mcols(kgConsensus)$symbol <- map$SYMBOL mcols(kgConsensus)$type <- "gene"
There are two types of error that can be evaluated when comparing predicted transcripts with annotations.
The optimal tuning parameters can be found by minimizing the sum of the two errors. This approach allows identification of the model that best fits the existing annotations and should more precisely predict transcripts in non-annotated parts of the genome.
e <- evaluateHMMInAnnotations(txHMM, kgConsensus) e$eval
Here we demonstrate how the optimal value for each tuning parameter can be obtained by running HMM multiple times over a range of the parameters. Among the nine test cases, both the sum of errors and error rate per called transcript show minimal at case #7, shown below. The variation of errors should be greater if whole chromosomes are used. Also, a larger set of the parameters might be used in practice. This tunning step takes long time, so you may skip it for quick review of the package.
tune <- data.frame( LtProbB=c(rep(-100,3), rep(-200,3), rep(-300,3)), UTS=rep(c(5,10,15), 3)) Fp <- windowAnalysis(Sall, strand="+", windowSize=50) Fm <- windowAnalysis(Sall, strand="-", windowSize=50) # evals <- bplapply(seq_len(9), function(x) { # hmm <- detectTranscripts( # Fp=Fp, Fm=Fm, LtProbB=tune$LtProbB[x],UTS=tune$UTS[x]) # e <- evaluateHMMInAnnotations(hmm$transcripts, kgConsensus) # e$eval # }) tune <- cbind(tune, do.call(rbind, evals)) # evals from the previous run tune which.min(tune$total) which.min(tune$errorRate)
To robustly compare transcripts with known genes, densities representing the frequency of transcripts can be calculated relative to their mapped gene annotations. Conceptually, the plot is divided into three distinct regions as shown in Figure \@ref(fig:density), including upstream of known gene annotations, inside genes, and downstream of the annotated polyadenylation site.
Metrics to evaluate the degree of overlap with gene annotations focus on the region upstream of gene annotations, which provides a measure of specificity, and the region inside of genes, which provides a measure of sensitivity. The region downstream of the polyadenylation site is known to contain residual transcription [@Core2008] and consequently is not used to define quality.
The metrics are defined relative to an "ideal" transcript caller, which takes the form of a step function (i.e., red line in the plot). Our quality metrics represent the following (see the plot below for a graphical representation):
We constrained 5$'$FP and 5$'$TN so that they sum to be TP in order to use the upstream region only if a positive number of transcipts are called in the gene body for the calucation of the quality metrics. Note that these quality metrics are conceptually very similar to true positive, true negative, false positive and false negative, respectively. During the comparison, only expressed annotations are used and genes either too small or too large in size are excluded. And also size of transcripts and annotations are scaled to a smaller unit, i.e., 1K for a visual representation. Here best overlapped annotations or transcripts are used for either "merged annotations" or "dissociated a single annotation" type of error. Final quality metrics are represented as TUA (Transcription Unit Accuracy).
getExpressedAnnotations <- function(features, reads) { fLimit <- limitToXkb(features) count <- countOverlaps(fLimit, reads) features <- features[count!=0,] features[ (quantile(width(features), .05) < width(features)) & (width(features) < quantile(width(features), .95)), ] } conExpressed <- getExpressedAnnotations(features=kgConsensus,reads=Sall)
td <- getTxDensity( txHMM, conExpressed, plot=TRUE) u <- par("usr") lines( c(u[1], 0, 0, 1000, 1000, u[2]), c(0, 0, u[4] - .04, u[4] - .04, 0, 0), col="red") legend("topright", lty=c(1,1), col=c(2,1), c("ideal", "groHMM")) text(c(-500, 500), c(.05, .5), c("FivePrimeFP", "TP")) td
If your target of study is non-mammalian,
you can retrieve the relevant annotations from the Bioconductor
if they are supported.
You can check the availability with the function supportedUCSCtable
in
r Biocpkg("GenomicFeatures")
.
If the organism is not supported,
you can still build consensus annotations by directly downloading
the annotation table for the organism from the UCSC genome browser.
The following command line shows the mysql query in linux
to download protein-coding RefSeq genes for all chromosomes
except chrM for C. Elegans
saving into refgene.bed
file.
You can find more information about using MySQL for the UCSC genome browser at
https://genome.ucsc.edu/goldenPath/help/mysql.html
```{bash Download non-mammalian genes, eval=FALSE} mysql \ --user=genome \ --host=genome-mysql.cse.ucsc.edu \ ce10 \ -e " select chrom, txStart, txEnd, name, exonCount, strand, name2 from refGene where chrom != 'chrM' and cdsStart != cdsEnd " | tail -n +1 > refgene.bed
```r G <- read.table("refgene.bed", header=TRUE, stringsAsFactors=FALSE, sep="\t") ce <- GRanges( G$chrom, IRanges(G$txStart, G$txEnd), strand=G$strand, access=G$name, gene_id=G$name2) ceConsensus <- makeConsensusAnnotations(ce)
As for transcript calling,
the default values for detectTranscripts
were set for mammalian genome.
The C. Elegans genome, is much smaller than human genome
and the genes are more tightly located.
Therefore we recommend higher values for the transition probability
from the transcribed to non-transcribed state.
For example, you can use values >-50 instead of -200 for LtProbB
.
Prediction of transcripts by the HMM is not perfect. Discrepancies with the annotations will occur even after the parameters are optimally tuned. Transcript calls can be "fixed" for known types of error by:
The following method will generate a final set of transcripts for further analysis.
bPlus <- breakTranscriptsOnGenes(txHMM, kgConsensus, strand="+") bMinus <- breakTranscriptsOnGenes(txHMM, kgConsensus, strand="-") txBroken <- c(bPlus, bMinus) txFinal <- combineTranscripts(txBroken, kgConsensus)
tdFinal <- getTxDensity(txFinal, conExpressed)
There are several packages in Bioconductor for differential expression analysis
such as r Biocpkg("DESeq")
, r Biocpkg("baySeq")
, r Biocpkg("DEGSeq")
,
or r Biocpkg("edgeR")
.
r Biocpkg("edgeR")
[@Robinson2010] is used for this demonstration.
Differential expression analysis can be done either using
called transcripts or known annotations
depending on the nature of your research question.
The procedures are quite similar except
reads are counted using the genomic locations of either
called transcript or known annotations.
For longer transcripts or annotated genes,
we use a window of +1 to +13 kb from the transcription start site (TSS),
which was chosen in order to exclude reads from RNA polymerases
engaged at the promoter
and to allow enough time for the elongation
of newly initiated Pol II [@Hah2011].
However, variations can be used,
including the entire length of the called or annotated transcripts.
# For called transcripts library(edgeR) txLimit <- limitToXkb(txFinal) ctS0mR1 <- countOverlaps(txLimit, S0mR1) ctS0mR2 <- countOverlaps(txLimit, S0mR2) ctS40mR1 <- countOverlaps(txLimit, S40mR1) ctS40mR2 <- countOverlaps(txLimit, S40mR2) pcounts <- as.matrix(data.frame(ctS0mR1, ctS0mR2, ctS40mR1, ctS40mR2)) group <- factor(c("S0m", "S0m", "S40m", "S40m")) lib.size <- c(NROW(S0mR1), NROW(S0mR2), NROW(S40mR1), NROW(S40mR2)) d <- DGEList(counts=pcounts, lib.size=lib.size, group=group) d <- estimateCommonDisp(d) et <- exactTest(d) de <- decideTestsDGE(et, p=0.001, adjust="fdr") detags <- seq_len(NROW(d))[as.logical(de)] # Number of transcripts regulated at 40m message("up: ",sum(de==1), " down: ", sum(de==-1))
plotSmear(et, de.tags=detags) # 2 fold up or down abline(h = c(-1,1), col="blue")
# For ucsc knownGenes kgChr7 <- transcripts(kgdb, columns=c("gene_id", "tx_id", "tx_name"), filter=list(tx_chrom="chr7")) map <- select(org.Hs.eg.db, keys=unique(unlist(mcols(kgChr7)$gene_id)), columns=c("SYMBOL"), keytype=c("ENTREZID")) missing <- elementNROWS(mcols(kgChr7)[,"gene_id"]) == 0 kgChr7 <- kgChr7[!missing,] inx <- match(unlist(mcols(kgChr7)$gene_id), map$ENTREZID) mcols(kgChr7)$symbol <- map[inx,"SYMBOL"] kgLimit <- limitToXkb(kgChr7) ctS0mR1 <- countOverlaps(kgLimit, S0mR1) ctS0mR2 <- countOverlaps(kgLimit, S0mR2) ctS40mR1 <- countOverlaps(kgLimit, S40mR1) ctS40mR2 <- countOverlaps(kgLimit, S40mR2) counts <- as.matrix(data.frame(ctS0mR1, ctS0mR2, ctS40mR1, ctS40mR2)) group <- factor(c("S0m", "S0m", "S40m", "S40m")) lib.size <- c(NROW(S0mR1), NROW(S0mR2), NROW(S40mR1), NROW(S40mR2)) d <- DGEList(counts=counts, lib.size=lib.size, group=group) d <- estimateCommonDisp(d) et <- exactTest(d) de <- decideTestsDGE(et, p=0.001, adjust="fdr") detags <- seq_len(NROW(d))[as.logical(de)] symbols <- mcols(kgChr7)$symbol # Number of unique genes regulated at 40m message("up: ", NROW(unique(symbols[de == 1]))) message("down: ", NROW(unique(symbols[de == -1]))) plotSmear(et, de.tags=detags) abline(h = c(-1,1), col="blue")
Metagenes show the distribution of reads near TSS of a set of regulated genes
(or some other alignable genomic features of interest).
It can be thought as a smoothed average of read density weighted
by expression over the set of TSS.
The runMetaGene
function has an option for sampling
.
If TRUE
, 10\% of the
transcription units are sampled with replacement 1,000 times
and the median value at each position in the transcription unit over the samples
is used for final metagene result.
Using subsampling results in an image more robust to outliers,
especially when the size of sample is relatively small.
upGenes <- kgChr7[de == 1, ] expReads <- mean(c(NROW(S0m), NROW(S40m))) # Metagene around TSS mg0m <- runMetaGene( features=upGenes, reads=S0m, size=100, normCounts=expReads / NROW(S0m), sampling=FALSE) mg40m <- runMetaGene( features=upGenes, reads=S40m, size=100, normCounts=expReads / NROW(S40m), sampling=FALSE) plotMetaGene <- function(POS=c(-10000:+9999), mg, MIN, MAX) { plot( POS, mg$sense, col="red", type="h", xlim=c(-5000, 5000), ylim=c(floor(MIN),ceiling(MAX)), ylab="Read Density", xlab="Position (relative to TSS)") points(POS, (-1 * rev(mg$antisense)), col="blue", type="h") abline(mean(mg$sense[5000:8000]), 0, lty="dotted") } MAX <- max(c(mg0m$sense, mg40m$sense)) MIN <- -1 * max(c(mg0m$antisense, mg40m$antisense))
plotMetaGene(mg=mg0m, MIN=MIN, MAX=MAX) plotMetaGene(mg=mg40m, MIN=MIN, MAX=MAX)
png(filename="metagene0m.png") plotMetaGene(mg=mg0m, MIN=MIN, MAX=MAX) dev.off() png(filename="metagene40m.png") plotMetaGene(mg=mg40m, MIN=MIN, MAX=MAX) dev.off() grid.arrange( rasterGrob(as.raster(readPNG("metagene0m.png"))), rasterGrob(as.raster(readPNG("metagene40m.png"))), ncol=2, widths=unit(c(0.45, 0.45), "npc"))
save(hmmResult, evals, file="Vignette.RData")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.