In omsai/groHMM: GRO-seq Analysis Pipeline

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Introduction

Global Nuclear Run On and Sequencing (GRO-seq) was developed to comprehensively map transcriptional activity in cells [@Core2008; @Hah2011]. GRO-seq, which provides a genome wide 'map' of the position and orientation of all transcriptionally active RNA polymerases, has become increasingly widely used in recent years because it has numerous advantages compared to alternative methods of transcriptome profiling, such as expression microarrays and RNA-seq. Among these, GRO-seq provides information on instantaneous transactional responses because it detects primary transcription, as opposed to mature, processed mRNA. In addition, because it is independent of RNA polyadenylation, processing, and stability, GRO-seq provides extensive information on the non-coding transcriptome, including primary miRNAs, lincRNAs, enhancer RNAs, and potentially additional, yet undiscovered classes of transcription occurring in cells [@Hah2011; @Hah2013; @Luo2014]. Thus, GRO-seq data provides a complete and instantaneous picture of transcription, and has extensive applications in deciphering the mechanisms of transcriptional regulation.

We have recently developed several important analytical approaches which make use of GRO-seq data to address new biological questions. Our pipleline has been packaged and documented, resulting in the r Rpackage("groHMM") package for Bioconductor. Among the more advanced features, r Rpackage("groHMM") predicts the boundaries of transcriptional activity across the genome de novo using a two-state hidden Markov model (HMM). Our model essentially divides the genome into "transcribed" and "non-transcribed" regions in a strand specific manner [@Hah2011]. We also use HMMs to identify the leading edge of Pol II at genes activated by a stimulus in GRO-seq time course data. This approach allows the genome-wide interrogation of transcription rates in cells [@Danko2013].

In addition to these advanced features, r Rpackage("groHMM") provides wrapper functions for counting raw reads [@PagesGRanges] and creating metagene (averaging) plots. r Rpackage("groHMM") takes over all aspects of analysis after reads have been aligned to a reference genome with short-read alignment tools. Although r Rpackage("groHMM") is tailored towards GRO-seq data, the same functions and analytical methodologies can, in principal, be applied to a wide variety of other short read data sets since the package includes a number of easily usable and extensible functions for general short read data analysis. This guide focuses on the most common application of the package.

Preparation

The r Rpackage("groHMM") package is available on GitHub and can be downloaded as follows:

source("https://bioconductor.org/biocLite.R")
install.packages("devtools")
devtools::install_github(
    "coregenomics/groHMM",
    ref="1.99.x",
    repos=BiocInstaller::biocinstallRepos())

Additional packages are not required to use r Rpackage("groHMM"), but they are used to download annotations and evaluate transcripts, and should be installed for this tutorial.

devtools::install_github(
    "coregenomics/groHMM",
    ref="1.99.x",
    repos=BiocInstaller::biocinstallRepos(),
    dependencies=TRUE)

Workflow

Read GRO-seq Data Files

In this tutorial we will use example data from Hah et al. [2011] (GEO accession GSE27463). This experiment was designed to assess transcriptional changes following treatment of MCF-7 cells with 17$\beta$ estradiol (E2). The data include a time course of GRO-seq data following treatment with E2 (i.e., 0, 10, 40 and 160 min.). Two biological replicates are available for each time point. BED files were obtained from GEO and lifted over to hg19 using the UCSC r Rpackage("liftOver") tool. In order to make the package size more manageable, we have included data from chromosome 7 only in 0 and 10 min. conditions. r Rpackage("groHMM") supports parallel processing via BiocParallel.

r Rpackage("groHMM") uses the GRanges class from the r Biocpkg("GenomicRanges") package to represent a collection of genomic features, allowing synergy with other useful packages in Bioconductor. Most of the functions in the package take at least two arguments: "reads" and "features". Reads represent the genomic coordinates of a set of mapped short reads. Features represent a set of genomic coordinates of interest such as genes, exons, or transcripts.

The example data included in this package can be loaded into R using the following commands.

if (!is.na(Sys.getenv("TRAVIS", NA)))
    BiocParallel::register(BiocParallel::SerialParam())

library(groHMM)
library(GenomicAlignments)

S0mR1 <- as(readGAlignments(system.file(
    "extdata", "S0mR1.bam",
    package="groHMM")), "GRanges")
S0mR2 <- as(readGAlignments(system.file(
    "extdata", "S0mR1.bam",
    package="groHMM")), "GRanges") # Use R1 as R2
S40mR1 <- as(readGAlignments(system.file(
    "extdata", "S40mR1.bam",
    package="groHMM")), "GRanges")
S40mR2 <- as(readGAlignments(system.file(
    "extdata", "S40mR1.bam",
    package="groHMM")), "GRanges") # Use R1 as R2
# Combine replicates
S0m <- c(S0mR1, S0mR2)
S40m <- c(S40mR1, S40mR2)

Transcript Calling

library(png)
library(grid)
library(gridExtra)
grid.arrange(
    rasterGrob(as.raster(readPNG("GROseqData.png"))),
    rasterGrob(as.raster(readPNG("GROseqHMM.png"))),
    ncol=2,
    widths=unit(c(0.55, 0.45), "npc"))

In r Rpackage("groHMM"), transcribed regions are detected de novo using a two-state hidden Markov model (HMM). The model takes GRO-seq read counts as input across the genome and divides the genome into "transcribed" and "non-transcribed" state as shown in Figure \@ref(fig:hmm). First, a single read set is generated by combining all samples for each time point. This combined approach improves sensitivity for transcripts with low expression levels. Combined reads are used to train the model parameters using the Baum-Welch Expectation Maximization (EM) algorithm. Each strand is modeled separately dividing the genome into non-overlapping 50 bp windows classified as either state.

Sall <- sort(c(S0m, S40m))
# hmmResult <- detectTranscripts(Sall, LtProbB=-200, UTS=5, threshold=1)

# Load hmmResult from the saved previous run 
load(system.file("extdata", "Vignette.RData", package="groHMM"))
txHMM <- hmmResult$transcripts
head(txHMM)

The detectTranscripts function also uses two hold-out parameters. These parameters, specified by the arguments LtProbB and UTS, represents the log-transformed transition probability of switching from transcribed state to non-transcribed state and variance of the emission probability for reads in the non-transcribed state, respectively. Holdout parameters are used to optimize the performance of HMM predictions on known genes.

Evaluation of Transcript Calling

Predicted transcripts are evaluated by comparison to known annotations, making the assumption that GRO-seq transcripts should largely be in agreement with available annotations. Two types of error may occur, as described below. The HMM parameters are evaluated by the sum of the error rates. The procedure involves collecting a set of high-confidence reference transcripts. An annotation dataset can be constructed by downloading from the UCSC database or, alternatively, pre-made TranscriptDb objects can be used if they are available from Bioconductor. We will use the UCSC knownGene track and retrieve transcript annotations with r Biocpkg("GenomicRanges") package [@CarsonGFeatures].

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
kgdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

library(GenomicFeatures)
# For refseq annotations:
# rgdb <- makeTxDbFromUCSC(genome="hg19", tablename="refGene")
# saveDb(hg19RGdb, file="hg19RefGene.sqlite")
# rgdb <- loadDb("hg19RefGene.sqlite")
kgChr7 <- transcripts(
    kgdb, columns=c("gene_id", "tx_id", "tx_name"),
    filter=list(tx_chrom="chr7"))
seqlevels(kgChr7) <- seqlevelsInUse(kgChr7)

Because annotations do not provide precise cell type-specific expression information, overlapping transcripts must be merged into a single set, in which each annotation represents the 5$'$- and 3$'$-most boundaries of genes. Different isoforms of each gene are collapsed into one using the ENTREZID. These consensus annotations are used for the evaluation of HMM calling.

# Collapse overlapping annotations
kgConsensus <- makeConsensusAnnotations(kgChr7)

library(org.Hs.eg.db)
map <- select(
    org.Hs.eg.db,
    keys=unlist(mcols(kgConsensus)$gene_id),
    columns=c("SYMBOL"),
    keytype=c("ENTREZID"))
mcols(kgConsensus)$symbol <- map$SYMBOL
mcols(kgConsensus)$type <- "gene"

There are two types of error that can be evaluated when comparing predicted transcripts with annotations.

The number of transcripts overlapping two or more annotations, (i.e., these transcripts "merged annotations together") and
The number of annotations that overlap two or more transcripts on the same strand (i.e., we say that these transcript calls "dissociated a single annotation") must be determined.

The optimal tuning parameters can be found by minimizing the sum of the two errors. This approach allows identification of the model that best fits the existing annotations and should more precisely predict transcripts in non-annotated parts of the genome.

e <- evaluateHMMInAnnotations(txHMM, kgConsensus)
e$eval

HMM Tuning

Here we demonstrate how the optimal value for each tuning parameter can be obtained by running HMM multiple times over a range of the parameters. Among the nine test cases, both the sum of errors and error rate per called transcript show minimal at case #7, shown below. The variation of errors should be greater if whole chromosomes are used. Also, a larger set of the parameters might be used in practice. This tunning step takes long time, so you may skip it for quick review of the package.

tune <- data.frame(
    LtProbB=c(rep(-100,3), rep(-200,3), rep(-300,3)),
    UTS=rep(c(5,10,15), 3))

Fp <- windowAnalysis(Sall, strand="+", windowSize=50)
Fm <- windowAnalysis(Sall, strand="-", windowSize=50)

# evals <- bplapply(seq_len(9), function(x) {
#     hmm <- detectTranscripts(
#         Fp=Fp, Fm=Fm, LtProbB=tune$LtProbB[x],UTS=tune$UTS[x])
#     e <- evaluateHMMInAnnotations(hmm$transcripts, kgConsensus)
#     e$eval
# })
tune <- cbind(tune, do.call(rbind, evals))  # evals from the previous run
tune
which.min(tune$total)
which.min(tune$errorRate)

To robustly compare transcripts with known genes, densities representing the frequency of transcripts can be calculated relative to their mapped gene annotations. Conceptually, the plot is divided into three distinct regions as shown in Figure \@ref(fig:density), including upstream of known gene annotations, inside genes, and downstream of the annotated polyadenylation site.

Metrics to evaluate the degree of overlap with gene annotations focus on the region upstream of gene annotations, which provides a measure of specificity, and the region inside of genes, which provides a measure of sensitivity. The region downstream of the polyadenylation site is known to contain residual transcription [@Core2008] and consequently is not used to define quality.

The metrics are defined relative to an "ideal" transcript caller, which takes the form of a step function (i.e., red line in the plot). Our quality metrics represent the following (see the plot below for a graphical representation):

true positive; gene body (TP) = (gene annotation area under the curve) / (max area for matched transcripts),
false negative; gene body (FN) = 1 - TP,
5$'$ false positive; upstream region (5$'$FP) = (5$'$ overhang area under the curve) /(max area for matched transcripts),
5$'$ true negative; upstream region (5$'$TN) = TP - 5$'$FP.

We constrained 5$'$FP and 5$'$TN so that they sum to be TP in order to use the upstream region only if a positive number of transcipts are called in the gene body for the calucation of the quality metrics. Note that these quality metrics are conceptually very similar to true positive, true negative, false positive and false negative, respectively. During the comparison, only expressed annotations are used and genes either too small or too large in size are excluded. And also size of transcripts and annotations are scaled to a smaller unit, i.e., 1K for a visual representation. Here best overlapped annotations or transcripts are used for either "merged annotations" or "dissociated a single annotation" type of error. Final quality metrics are represented as TUA (Transcription Unit Accuracy).

getExpressedAnnotations <- function(features, reads) {
    fLimit <- limitToXkb(features)
    count <- countOverlaps(fLimit, reads)
    features <- features[count!=0,]
    features[
        (quantile(width(features), .05) < width(features))
        & (width(features) < quantile(width(features), .95)),
    ]
}
conExpressed <- getExpressedAnnotations(features=kgConsensus,reads=Sall)

td <- getTxDensity(
    txHMM, conExpressed, plot=TRUE)
u <- par("usr")
lines(
    c(u[1], 0, 0, 1000, 1000, u[2]),
    c(0, 0, u[4] - .04, u[4] - .04, 0, 0),
    col="red")
legend("topright", lty=c(1,1), col=c(2,1), c("ideal", "groHMM"))
text(c(-500, 500), c(.05, .5), c("FivePrimeFP", "TP"))
td

Working with non-mammalian Genomes

If your target of study is non-mammalian, you can retrieve the relevant annotations from the Bioconductor if they are supported. You can check the availability with the function supportedUCSCtable in r Biocpkg("GenomicFeatures"). If the organism is not supported, you can still build consensus annotations by directly downloading the annotation table for the organism from the UCSC genome browser. The following command line shows the mysql query in linux to download protein-coding RefSeq genes for all chromosomes except chrM for C. Elegans saving into refgene.bed file. You can find more information about using MySQL for the UCSC genome browser at https://genome.ucsc.edu/goldenPath/help/mysql.html

```{bash Download non-mammalian genes, eval=FALSE} mysql \ --user=genome \ --host=genome-mysql.cse.ucsc.edu \ ce10 \ -e " select chrom, txStart, txEnd, name, exonCount, strand, name2 from refGene where chrom != 'chrM' and cdsStart != cdsEnd " | tail -n +1 > refgene.bed

```r
G <- read.table("refgene.bed", header=TRUE, stringsAsFactors=FALSE, sep="\t")
ce <- GRanges(
    G$chrom, IRanges(G$txStart, G$txEnd), strand=G$strand, access=G$name,
    gene_id=G$name2)
ceConsensus <- makeConsensusAnnotations(ce)

As for transcript calling, the default values for detectTranscripts were set for mammalian genome. The C. Elegans genome, is much smaller than human genome and the genes are more tightly located. Therefore we recommend higher values for the transition probability from the transcribed to non-transcribed state. For example, you can use values >-50 instead of -200 for LtProbB.

Repairing Transcript Calling with Annotations

Prediction of transcripts by the HMM is not perfect. Discrepancies with the annotations will occur even after the parameters are optimally tuned. Transcript calls can be "fixed" for known types of error by:

breaking transcripts that have merged annotations and
combining transcripts that have dissociated a single annotation.

The following method will generate a final set of transcripts for further analysis.

bPlus <- breakTranscriptsOnGenes(txHMM, kgConsensus, strand="+")
bMinus <- breakTranscriptsOnGenes(txHMM, kgConsensus, strand="-")
txBroken <- c(bPlus, bMinus)
txFinal <- combineTranscripts(txBroken, kgConsensus)

tdFinal <- getTxDensity(txFinal, conExpressed)

Differential Analysis with edgeR

There are several packages in Bioconductor for differential expression analysis such as r Biocpkg("DESeq"), r Biocpkg("baySeq"), r Biocpkg("DEGSeq"), or r Biocpkg("edgeR"). r Biocpkg("edgeR") [@Robinson2010] is used for this demonstration. Differential expression analysis can be done either using called transcripts or known annotations depending on the nature of your research question. The procedures are quite similar except reads are counted using the genomic locations of either called transcript or known annotations. For longer transcripts or annotated genes, we use a window of +1 to +13 kb from the transcription start site (TSS), which was chosen in order to exclude reads from RNA polymerases engaged at the promoter and to allow enough time for the elongation of newly initiated Pol II [@Hah2011]. However, variations can be used, including the entire length of the called or annotated transcripts.

# For called transcripts
library(edgeR)
txLimit <- limitToXkb(txFinal)
ctS0mR1 <- countOverlaps(txLimit, S0mR1)
ctS0mR2 <- countOverlaps(txLimit, S0mR2)
ctS40mR1 <- countOverlaps(txLimit, S40mR1)
ctS40mR2 <- countOverlaps(txLimit, S40mR2)

pcounts <- as.matrix(data.frame(ctS0mR1, ctS0mR2, ctS40mR1, ctS40mR2))
group <- factor(c("S0m", "S0m", "S40m", "S40m"))
lib.size <- c(NROW(S0mR1), NROW(S0mR2), NROW(S40mR1), NROW(S40mR2))
d <- DGEList(counts=pcounts, lib.size=lib.size, group=group)
d <- estimateCommonDisp(d)
et <- exactTest(d)
de <- decideTestsDGE(et, p=0.001, adjust="fdr")
detags <- seq_len(NROW(d))[as.logical(de)]
# Number of transcripts regulated at 40m
message("up: ",sum(de==1), " down: ", sum(de==-1))

plotSmear(et, de.tags=detags)
# 2 fold up or down
abline(h = c(-1,1), col="blue")

# For ucsc knownGenes
kgChr7 <- transcripts(kgdb, columns=c("gene_id", "tx_id", "tx_name"),
                            filter=list(tx_chrom="chr7"))
map <- select(org.Hs.eg.db,
        keys=unique(unlist(mcols(kgChr7)$gene_id)),
        columns=c("SYMBOL"), keytype=c("ENTREZID"))
missing <- elementNROWS(mcols(kgChr7)[,"gene_id"]) == 0
kgChr7 <- kgChr7[!missing,]

inx <- match(unlist(mcols(kgChr7)$gene_id), map$ENTREZID)
mcols(kgChr7)$symbol <- map[inx,"SYMBOL"]

kgLimit <- limitToXkb(kgChr7)
ctS0mR1 <- countOverlaps(kgLimit, S0mR1)
ctS0mR2 <- countOverlaps(kgLimit, S0mR2)
ctS40mR1 <- countOverlaps(kgLimit, S40mR1)
ctS40mR2 <- countOverlaps(kgLimit, S40mR2)

counts <- as.matrix(data.frame(ctS0mR1, ctS0mR2, ctS40mR1, ctS40mR2))
group <- factor(c("S0m", "S0m", "S40m", "S40m"))
lib.size <- c(NROW(S0mR1), NROW(S0mR2), NROW(S40mR1), NROW(S40mR2))
d <- DGEList(counts=counts, lib.size=lib.size, group=group)

d <- estimateCommonDisp(d)
et <- exactTest(d)
de <- decideTestsDGE(et, p=0.001, adjust="fdr")
detags <- seq_len(NROW(d))[as.logical(de)]
symbols <- mcols(kgChr7)$symbol
# Number of unique genes regulated at 40m
message("up: ", NROW(unique(symbols[de == 1])))
message("down: ", NROW(unique(symbols[de == -1])))
plotSmear(et, de.tags=detags)
abline(h = c(-1,1), col="blue")

Metagene Analysis

Metagenes show the distribution of reads near TSS of a set of regulated genes (or some other alignable genomic features of interest). It can be thought as a smoothed average of read density weighted by expression over the set of TSS. The runMetaGene function has an option for sampling. If TRUE, 10\% of the transcription units are sampled with replacement 1,000 times and the median value at each position in the transcription unit over the samples is used for final metagene result. Using subsampling results in an image more robust to outliers, especially when the size of sample is relatively small.

upGenes <- kgChr7[de == 1, ]
expReads <- mean(c(NROW(S0m), NROW(S40m)))
# Metagene around TSS
mg0m <- runMetaGene(
    features=upGenes, reads=S0m, size=100, normCounts=expReads / NROW(S0m),
    sampling=FALSE)
mg40m <- runMetaGene(
    features=upGenes, reads=S40m, size=100, normCounts=expReads / NROW(S40m),
    sampling=FALSE)

plotMetaGene <- function(POS=c(-10000:+9999), mg, MIN, MAX) {
    plot(
        POS, mg$sense, col="red", type="h", xlim=c(-5000, 5000),
        ylim=c(floor(MIN),ceiling(MAX)), ylab="Read Density",
        xlab="Position (relative to TSS)")
     points(POS, (-1 * rev(mg$antisense)), col="blue", type="h")
     abline(mean(mg$sense[5000:8000]), 0, lty="dotted")
}

MAX <- max(c(mg0m$sense, mg40m$sense))
MIN <- -1 * max(c(mg0m$antisense, mg40m$antisense))

plotMetaGene(mg=mg0m, MIN=MIN, MAX=MAX)
plotMetaGene(mg=mg40m, MIN=MIN, MAX=MAX)

png(filename="metagene0m.png")
plotMetaGene(mg=mg0m, MIN=MIN, MAX=MAX)
dev.off()
png(filename="metagene40m.png")
plotMetaGene(mg=mg40m, MIN=MIN, MAX=MAX)
dev.off()
grid.arrange(
    rasterGrob(as.raster(readPNG("metagene0m.png"))),
    rasterGrob(as.raster(readPNG("metagene40m.png"))),
    ncol=2,
    widths=unit(c(0.45, 0.45), "npc"))

save(hmmResult, evals, file="Vignette.RData")

Session Info

sessionInfo()

References

omsai/groHMM documentation built on May 24, 2019, 2:18 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

omsai/groHMM
GRO-seq Analysis Pipeline

In omsai/groHMM: GRO-seq Analysis Pipeline

Introduction

Preparation

Workflow

Read GRO-seq Data Files

Transcript Calling

Evaluation of Transcript Calling

HMM Tuning

Working with non-mammalian Genomes

Repairing Transcript Calling with Annotations

Differential Analysis with edgeR

Metagene Analysis

Session Info

References

R Package Documentation

Browse R Packages

We want your feedback!

omsai/groHMM GRO-seq Analysis Pipeline

In omsai/groHMM: GRO-seq Analysis Pipeline

Introduction

Preparation

Workflow

Read GRO-seq Data Files

Transcript Calling

Evaluation of Transcript Calling

HMM Tuning

Working with non-mammalian Genomes

Repairing Transcript Calling with Annotations

Differential Analysis with edgeR

Metagene Analysis

Session Info

References

R Package Documentation

Browse R Packages

We want your feedback!

omsai/groHMM
GRO-seq Analysis Pipeline