The purpose of this document is to generate, from the data included here, many of the figures, tables, and supplementary documents present in the StateHub / StatePaintR paper, and provide an introduction to using the StatePaintR package.


What is StateHub / StatePaintR?

Genome annotation is critical to understand the function of disease variants, especially for clinical applications. To meet this need there are segmentations available from public consortia reflecting varying unsupervised approaches to functional annotation based on epigenetics data, but there remains a need for transparent, reproducible, and easily interpreted genomic maps of the functional biology of chromatin. We introduce here methods for defining chromatin state with a combinatorial epigenomic model using an annotation tool, StatePaintR and a website database, StateHub. Annotations are fully documented with change history and versioning, authorship information, and original source files. The tool calculates quantitative state scores based on genome-wide ranking, allowing prioritization and enrichment testing, facilitating quantitative analysis. StateHub hosts annotation tracks for major public consortia as a resource, and allows users to submit their own alternative models.

A preprint is availible on bioRxiv A more comprehensive overview of of the analysis in the paper can be found in the supplemental Rmarkdown file on StateHub

Setup of StatePaintR environment

(Optional) Install the development version from github

source("https://bioconductor.org/biocLite.R")
biocLite("devtools")
## Installing from https://github.com/Simon-Coetzee/StatePaintR/
biocLite("Simon-Coetzee/StatePaintR") 

We begin by loading the required packages, not all of these are required to simply run StatePaintR, but they will be necissary for doing some of the analysis that follows.

library(GenomicRanges)
library(RColorBrewer)
library(ggplot2)
library(httr)
library(readr)
library(StatePaintR)

Additionally we download all of the data required to run this vignette:

download.file("https://s3-us-west-2.amazonaws.com/statehub-trackhub/statepaintr_vignette_data.tar.gz", "statepaintr_data.tar.gz")
untar("statepaintr_data.tar.gz", compressed = "gzip")
data.frame(FILES = list.files("statepaintr_data", all.files = FALSE, recursive = TRUE))

With This complete, we begin analysis.

Brief overview of running StatePaintR

The process of running StatePaintR falls into three basic steps.

Download the decision matrix

We first download the decision matrix by indicating the model's unique ID as indicated on the StateHub website.

decisionmatrix <- get.decision.matrix(search = "5813b67f46e0fb06b493ceb0")
decisionmatrix

Generate a Manifest

We need a manifest indicating the data that we want to segment

manifest <- "statepaintr_data/manifest.hmec.txt"
read.table(manifest,
           sep = "\t",
           stringsAsFactors = FALSE,
           header = TRUE)

Segment the Genome

Using the manifest, and the decision matrix we aquired above, we can segment the genome

hmec.states <- PaintStates(manifest = "statepaintr_data/manifest.hmec.txt",
                           decisionMatrix = decisionmatrix,
                           progress = FALSE)
hmec.states
attributes(hmec.states)$manifest
attributes(hmec.states)$decisionMatrix

Export the segments

Once we have generated the segmentations, we can write them to disk as a bed file for viewing in a genome browser, or for additional analysis outside of R

ExportStatePaintR(states = hmec.states,
                  output.dir = "statepaintr_data/HMEC/")

Performance of enhancer predictions

VISTA validated enhancers

In order to determine a basis for true positive enhancers we use enhancers validated by the VISTA enhancer browser. We can begin by reading the VISTA data from a bed file. The original data comes from the ENCODE annotation file set ENCSR964TTF

vista.enhancers <- read_tsv("statepaintr_data/vista.validated.enhancers.txt", col_names = TRUE)
vista.enhancers <- GRanges(vista.enhancers)
vista.enhancers

As you can see, it contains the ranges of the tested enhancers, if they were validated in vivo, and what tissues they were validated in. This file has been slightly modified to include enhancers that may be found in the "embryonic facial prominence", by combining "ear", "eye", "branchial arch", "nose", and "facial mesenchyme".

ENCODE ChIP-Seq and DNase-seq data

Using the data made availible by the ENCODE project we can retrieve data on multiple Histone ChIP-Seq and sometimes DNase-seq for relevent tissues in order to make enhancer predictions.

| Dataset | Accession Number | |----------------------------------------|--------------------------------------------------------------------------------| | embryonic mouse neural tube (11.5 day) | ENCSR215ZYV | | embryonic mouse midbrain (11.5 day) | ENCSR843IAS | | embryonic mouse hindbrain (11.5 day) | ENCSR501OPC | | embryonic mouse limb (11.5 day) | ENCSR283NCE | | embryonic mouse heart (11.5 day) | ENCSR016LTR |

Narrowpeak calls for this data and our manifest are included with this document. In order to use DNase-seq when availible, we used the IDR process to merge replicates.

manifest <- "statepaintr_data/mouse.idr/IDR.Manifest.txt"
head(read.table(manifest,
                sep = "\t",
                stringsAsFactors = FALSE,
                header = TRUE))

Aquiring a StateHub model

We can begin our enhancer predictions by segmenting the genome for these samples. We have our manifest, now all we need is a decisionMatrix from StateHub to define the rules for segmenting the genome.

dm <- get.decision.matrix("5813b67f46e0fb06b493ceb0")
dm

Our decision matrix is downloaded, but in order to score our enhancers we make some modifications. We want to exclude the "Regulatory" mark from our scoring process, and we want keep DNase-seq peaks intact as the core of our features.

dm <- doNotScore(dm, "Regulatory")
dm <- doNotSplit(dm, "Core")
dm

Running StatePaintR to get enhancer predictions and scores

With preparation complete, we can now segment the genome for our samples defined in the manifest.

mouse.embryo.states <- PaintStates(manifest = manifest, 
                                   decisionMatrix = dm, 
                                   scoreStates = TRUE,
                                   progress = FALSE)
mouse.embryo.states$heart
mymart <- biomaRt::useMart("ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl")
mylocation <- GRanges("chr1:42623508-42754885")
library(Gviz)
mygenes <- BiomartGeneRegionTrack(start = start(mylocation), end = end(mylocation),
                                  chr = seqlevels(mylocation),
                                  biomart = mymart,
                                  col.line = NULL, col = NULL,
                                  stackHeight = 0.5,
                                  rotation.title = 360, col.title = "black", cex.title = 0.5,
                                  filter = list(biotype = c("protein_coding", "lincRNA")),
                                  transcriptAnnotation = "symbol",
                                  name = "ENSEMBL genes", stacking = "squish")
genome(mygenes) <- "mm10"
vista.in.range <- subsetByOverlaps(vista.enhancers, mylocation)
vista.anno <- AnnotationTrack(range = vista.in.range,
                              fill = mcols(vista.in.range)$validated + 2,
                              stacking = "dense",
                              col = NULL,
                              col.line = NULL,
                              stackHeight = 1,
                              shape = "box",
                              rotation.title = 360,
                              background.title = "transparent", col.title = "black",
                              cex.title = 0.5,
                              name = "VISTA enhancers")

PlotStates(states = mouse.embryo.states, 
           location = mylocation, 
           gene.track = mygenes, 
           additional.tracks = vista.anno)

Comparing Predictions to VISTA {#train_subset}

All of our model tuning was done on a selection of 100 valid enhancers for each tissue type, so we'll exclude those 100 from our subsequent comparisons.

seed <- 42
train.enhancers <- list()
test.enhancers <- list()
set.seed(seed)
for (tissue in names(mouse.embryo.states)) {
  vista.enhancers.train <- vista.enhancers[, tissue]
  vista.enhancers.train <- sample(which(mcols(vista.enhancers.train)[, tissue] == 1L),
                                  size = 100)
  train.enhancers <- c(train.enhancers, list(vista.enhancers.train))
  names(train.enhancers)[length(train.enhancers)] <- tissue
  vista.enhancers.test <- c(1:length(vista.enhancers))[-vista.enhancers.train]
  test.enhancers <- c(test.enhancers, list(vista.enhancers.test))
  names(test.enhancers)[length(test.enhancers)] <- tissue
}

Evaluation of our models, and the external enhancer predictions against which we compare, is done with Precision-Recall-Gain Curves [^1].

[^1]: From the abstract of the paper: "Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier's performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions -- e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the Fβ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected F1 score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier's scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimises Fβ. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection."

StatePaintR Predictions

Here we generate the StatePaintR predictions, using the states EAR, EARC, AR, and ARC as our predicted enhancers. This also prepares the data for plotting in ggplot2 by extracting the precision gain and recall gain, and the convex hull. Additionally we look at the area under the precision recall gain curve to get an idea of the accuracy.

plot.data.sp <- PRG(states = mouse.embryo.states,
                    comparison = vista.enhancers,
                    state.select = c("EARC", "ARC", "AR", "EAR"),
                    comparison.select = test.enhancers)
plot.data.sp$auprg

Load Enhancer Predictions

ENCODE v3 Enhancer-like regions

We can so something similar for evaluating ENCODE candiate enhancer calls We used the following data from the ENCODE data portal:

| Enhancer-like regions | Accession Number | |----------------------------------------------------|-----------------------------------------------------------| | using DNase and H3K27ac for neural tube (11.5 day) | ENCFF786KUB | | using DNase and H3K27ac for midbrain (11.5 day) | ENCFF733UJT | | using DNase and H3K27ac for hindbrain (11.5 day) | ENCFF324INM | | using DNase and H3K27ac for limb (11.5 day) | ENCFF520EGD | | using H3K27ac for heart (11.5 day) | ENCSR312DDF |

So we begin by downloading this data, and coverting it into GRanges objects

encode.enhancers <- c("neural tube" = "ENCFF786KUB", 
                      midbrain      = "ENCFF733UJT", 
                      hindbrain     = "ENCFF324INM",
                      limb          = "ENCFF520EGD",
                      heart         = "ENCFF435VGC")
encode.enhancers <- sapply(encode.enhancers, function(x) {
  encode <- "https://www.encodeproject.org"
  x <- content(GET(encode, path = x))$href
  return(paste0(encode, x))
})

encode.enhancers <- lapply(encode.enhancers, function(x) {
  x <- read_tsv(x, col_names = FALSE)
  x$X5 <- nrow(x):1
  x <- GRanges(seqnames = x$X1,
               ranges = IRanges(start = x$X2,
                                end = x$X3),
               score = x$X5,
               seqinfo = Seqinfo(genome = "mm10"))
  return(x)
})
encode.enhancers$heart

REPTILE Enhancer Predictions

Regulatory element prediction based on tissue-specific local epigenetic marks (REPTILE) is described in Improved regulatory element prediction based on tissue-specific local epigenomic signatures, and integrates histone modification and whole-genome cytosine DNA methylation profiles to identify the precise location of enhancers. This paper also includes results for DELTA, RFECS, and CSI-ANN which we will be compairing against.

reptile.enhancers <- c("neural tube" = "NT", 
                       midbrain      = "MB", 
                       hindbrain     = "HB",
                       limb          = "LM",
                       heart         = "HT")
reptile.enhancers <- lapply(reptile.enhancers, function(x) {
  file.path <- file.path("statepaintr_data", "enhancer_predictions", "REPTILE", paste0("REPTILE_pred_E11_5_", x, ".bed"))
  x <- read_tsv(file.path, col_names = FALSE)
  x <- GRanges(seqnames = x$X1,
               ranges = IRanges(start = x$X2,
                                end = x$X3),
               score = x$X5,
               enhancername = x$X4)
  return(x)

})
reptile.enhancers$heart

DELTA Enhancer Predictions

DELTA (Distal Enhancer Locating Tool based on AdaBoost) is described in DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications and defines a set of non-redundant shape features of histone modifications, which shows high consistency across cell types and can greatly reduce the dimensionality of feature vectors which is then integrated with a machine-learning algorithm AdaBoost to predict enhancers.

delta.enhancers <- c("neural tube" = "NT", 
                       midbrain      = "MB", 
                       hindbrain     = "HB",
                       limb          = "LM",
                       heart         = "HT")
delta.enhancers <- lapply(delta.enhancers, function(x) {
  file.path <- file.path("statepaintr_data", "enhancer_predictions", "DELTA", paste0("DELTA_pred_E11_5_", x, ".bed"))
  x <- read_tsv(file.path, col_names = FALSE)
  x <- GRanges(seqnames = x$X1,
               ranges = IRanges(start = x$X2,
                                end = x$X3),
               score = x$X5,
               enhancername = x$X4)
  return(x)

})
delta.enhancers$heart

RFECS Enhancer Predictions

RFECS (Random Forest based Enhancer identification from Chromatin States) is described in RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State and is used to predict genome-wide enhancers based on their similarity to the histone modification profiles of p300 binding sites.

rfecs.enhancers <- c("neural tube" = "NT", 
                       midbrain      = "MB", 
                       hindbrain     = "HB",
                       limb          = "LM",
                       heart         = "HT")
rfecs.enhancers <- lapply(rfecs.enhancers, function(x) {
  file.path <- file.path("statepaintr_data", "enhancer_predictions", "RFECS", paste0("RFECS_pred_E11_5_", x, ".bed"))
  x <- read_tsv(file.path, col_names = FALSE)
  x <- GRanges(seqnames = x$X1,
               ranges = IRanges(start = x$X2,
                                end = x$X3),
               score = x$X5,
               enhancername = x$X4)
  return(x)

})
rfecs.enhancers$heart

CSI-ANN Enhancer Predictions

CSI-ANN (chromatin signature identification by artificial neural network) is described in Discover regulatory DNA elements using chromatin signatures and artificial neural network and is a framework that consists of a data transformation and a feature extraction step followed by a classification step using time-delay neural network.

csiann.enhancers <- c("neural tube" = "NT", 
                       midbrain      = "MB", 
                       hindbrain     = "HB",
                       limb          = "LM",
                       heart         = "HT")
csiann.enhancers <- lapply(csiann.enhancers, function(x) {
  file.path <- file.path("statepaintr_data", "enhancer_predictions", "CSIANN", paste0("CSIANN_pred_E11_5_", x, ".bed"))
  x <- read_tsv(file.path, col_names = FALSE)
  x <- GRanges(seqnames = x$X1,
               ranges = IRanges(start = x$X2,
                                end = x$X3),
               score = x$X5,
               enhancername = x$X4)
  return(x)

})
csiann.enhancers$heart

Comparing External Enhancer Data Sets to Vista

As with StatePaintR compare this data to the VISTA dataset.

ENCODE v3 Enhancer-like regions

plot.data.encode <- PRG(states = encode.enhancers,
                        comparison = vista.enhancers,
                        comparison.select = test.enhancers)
plot.data.encode$auprg

REPTILE

plot.data.reptile <- PRG(states = reptile.enhancers,
                        comparison = vista.enhancers,
                        comparison.select = test.enhancers)
plot.data.reptile$auprg

DELTA

plot.data.delta <- PRG(states = delta.enhancers,
                       comparison = vista.enhancers,
                       comparison.select = test.enhancers)
plot.data.delta$auprg

RFECS

plot.data.rfecs <- PRG(states = rfecs.enhancers,
                       comparison = vista.enhancers,
                       comparison.select = test.enhancers)
plot.data.rfecs$auprg

CSI-ANN

plot.data.csiann <- PRG(states = csiann.enhancers,
                        comparison = vista.enhancers,
                        comparison.select = test.enhancers)
plot.data.csiann$auprg

Additional Examples can be found in the supplemental Rmarkdown file on StateHub

Visualization of Predictions of VISTA Validated enhancers

We begin by doing a little bit of cleaning up the data to prepare for plotting with ggplot2

plot.data.encode$precision.recall.gain$SOURCE <- "ENCODE"
plot.data.sp$precision.recall.gain$SOURCE <- "StatePaintR"
plot.data.reptile$precision.recall.gain$SOURCE <- "REPTILE"
plot.data.delta$precision.recall.gain$SOURCE <- "DELTA"
plot.data.rfecs$precision.recall.gain$SOURCE <- "RFECS"
plot.data.csiann$precision.recall.gain$SOURCE <- "CSIANN"

precision.recall.gain <- rbind.data.frame(plot.data.encode$precision.recall.gain, 
                                          plot.data.sp$precision.recall.gain,
                                          plot.data.reptile$precision.recall.gain,
                                          plot.data.delta$precision.recall.gain,
                                          plot.data.rfecs$precision.recall.gain,
                                          plot.data.csiann$precision.recall.gain)

plot.data.encode$convex.hull$SOURCE <- "ENCODE"
plot.data.sp$convex.hull$SOURCE <- "StatePaintR"
plot.data.reptile$convex.hull$SOURCE <- "REPTILE"
plot.data.delta$convex.hull$SOURCE <- "DELTA"
plot.data.rfecs$convex.hull$SOURCE <- "RFECS"
plot.data.csiann$convex.hull$SOURCE <- "CSIANN"
convex.hull <- rbind.data.frame(plot.data.encode$convex.hull, 
                                plot.data.sp$convex.hull,
                                plot.data.reptile$convex.hull,
                                plot.data.delta$convex.hull,
                                plot.data.rfecs$convex.hull,
                                plot.data.csiann$convex.hull)

comboplot <- ggplot(precision.recall.gain, aes(y = PRECISION, x = RECALL, group = SOURCE)) +
  geom_line(aes(color = SOURCE)) +
  geom_point(aes(color = SOURCE)) +
  coord_cartesian(xlim=c(0,1), ylim = c(0.4,1)) +
  scale_color_brewer(palette = "Dark2") +
  geom_line(data = convex.hull, aes(y = PRECISION, 
                                    x = RECALL, 
                                    group = SOURCE, 
                                    color = SOURCE), linetype = 2) +
  theme_grey() +
  ylab("Precision Gain") + xlab("Recall Gain") + theme(aspect.ratio=1) +
  facet_wrap( ~ TISSUE, ncol = 3)
comboplot

Area Under the Precision-Recall-Gain Curve

From all of this analysis we generate the Area Under the Precision-Recall-Gain Curve (AUPRG) which conveys an expected F1 score on a harmonic scale.

auprg <- data.frame(source = c("ENCODE", "StatePaintR", "REPTILE", "DELTA", "RFECS", "CSIANN"),
                    "neural tube" = NA,
                    midbrain = NA,
                    hindbrain = NA,
                    limb = NA,
                    heart = NA,
                    check.names = FALSE)

auprg[auprg$source == "ENCODE", names(plot.data.encode$auprg)] <- plot.data.encode$auprg
auprg[auprg$source == "REPTILE", names(plot.data.reptile$auprg)] <- plot.data.reptile$auprg
auprg[auprg$source == "DELTA", names(plot.data.delta$auprg)] <- plot.data.delta$auprg
auprg[auprg$source == "RFECS", names(plot.data.rfecs$auprg)] <- plot.data.rfecs$auprg
auprg[auprg$source == "CSIANN", names(plot.data.csiann$auprg)] <- plot.data.csiann$auprg
auprg[auprg$source == "StatePaintR", names(plot.data.sp$auprg)] <- plot.data.sp$auprg

auprg$ave_auprg <- rowSums(auprg[, -1], na.rm = TRUE)/5
average_rank <- sapply(auprg[, c("neural tube", 
                                 "midbrain", 
                                 "hindbrain", 
                                 "limb", 
                                 "heart")], function(x) {rank(1-x)})
auprg$ave_rank <- sapply(split(average_rank, 1:nrow(average_rank)), mean)
pauprg <- auprg
pauprg[, -1] <- signif(pauprg[, -1], digits = 2)
knitr::kable(pauprg)

Example of model comparisons.

Comparing Models

Enrichment calculations were done using either of two different state models (Model 1 and Model 2) from StateHub, "Default" and "Focused Poised Promoter", which differ in the treatment of poised promoters. Each plot is made using the same y axis range for comparison and emphasizes that one model is clearly more selective than the other. Both models clearly detect enrichment of hypermethylated probes in the poised state. Model 2 is more selective than model 1 in its definition of poised promoter.

enrichment.out <- read_tsv("statepaintr_data/methylation.enrichment.txt")
enrichment.out <- enrichment.out[enrichment.out$type != "ENCODE2012", ]
head(enrichment.out)

Model 1 - Default Model

In Model 1, we assign any promoters lacking active marks to the poised state.

model1.plot <- ggplot(enrichment.out[enrichment.out$model == "hyper1", ], aes(name, oddsratio, group = color)) +
  geom_pointrange(aes(ymin = odds.lower, ymax = odds.upper, color = color), fatten = 1, size = 1.2) +
  theme_minimal() +
  scale_color_identity() +
  geom_hline(yintercept = 1, color = "#000000", alpha = 0.1) +
  theme(axis.text.x = element_blank(),
        legend.position="none",
        panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
        strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5, size = rel(1.5)),
        strip.text.x = element_text(angle = 270, hjust = 0.5, vjust = 1, size = rel(1.5))) +
  scale_x_discrete(name = "Sample") +
  scale_y_continuous(name = "Odds Ratio") +
  coord_cartesian(ylim = c(0,12)) +
  facet_grid(state ~ type, scales = "free_x", space = "free_x", switch = "x") +
  annotate("rect", xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=1,alpha=0.1,fill="black")
model1.plot

Model 2 - Focused Poised Promoter

In this model, enhancers with H3K4me1 and promoters with H3K4me3 overlapping narrow regions of H3K27me3 are called poised (EPR and PPR), but those without H3K27me3 are called weak (EWR and PWR)

model2.plot <- ggplot(enrichment.out[enrichment.out$model == "hyper2", ], aes(name, oddsratio, group = color)) +
  geom_pointrange(aes(ymin = odds.lower, ymax = odds.upper, color = color), fatten = 1, size = 1.2) +
  theme_minimal() +
  scale_color_identity() +
  geom_hline(yintercept = 1, color = "#000000", alpha = 0.1) +
  theme(axis.text.x = element_blank(),
        legend.position="none",
        panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank(),
        strip.text.y = element_text(angle = 0, hjust = 0, vjust = 0.5, size = rel(1.5)),
        strip.text.x = element_text(angle = 270, hjust = 0.5, vjust = 1, size = rel(1.5))) +
  scale_x_discrete(name = "Sample") +
  scale_y_continuous(name = "Odds Ratio") +
  coord_cartesian(ylim = c(0,12)) +
  facet_grid(state ~ type, scales = "free_x", space = "free_x", switch = "x") +
  annotate("rect", xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=1,alpha=0.1,fill="black")
model2.plot

Clean Up

This file was last modified on April 18, 2017

sessionInfo()
unlink("statepaintr_data", recursive = TRUE)


Simon-Coetzee/footprintR documentation built on May 9, 2019, 1:31 p.m.