In Bioconductor/HMMRATAC: HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models

suppressPackageStartupMessages({
  library(ATACseqQC)
  library(ChIPpeakAnno)
  library(BSgenome.Hsapiens.UCSC.hg19)
  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
  library(MotifDb)
  library(GenomicAlignments)
})
knitr::opts_chunk$set(warning=FALSE, message=FALSE)

Introduction

Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is a method to identify open chromatin regions in a genome-wide manner. This method involves using a transposase to cut sections of accessable DNA selectively [@buenostro2013transposition]. ATAC-seq identifies regions highly corrrelated with similar methods used to identify these regions, DNase-seq [@boyle2008high] and FAIRE-seq [@giresi2007faire]. ATAC-seq has become a popular method for this task because it is faster and easier to perform, has a higher signal to noise ration, and does not require as large of an amount of biological samples when comapred with other methods [@buenostro2013transposition].

Hidden Markov ModeleR for ATAC-seq (HMMRATAC) is a semi-supervised machine learning appraoch for identifying open chromatin regions from ATAC-seq data. The idea behind HMMRATAC is conceptualized upon the idea of "decomposition and integration. The ATAC-seq data is decomposed into different layers of coverage signals corresponding to the sequnced DNA fragments originated from nucleosomal regions. The relationship between the layers of signals at open chromatin regions is then learned using a Hidden Markov Model and utilized for predicting open chromatin. HMMRATAC is capable of outperforming similar methods in identifying chromatin structure and transcription factor binding sites. [@hmmratac2019tarbell]

A typical analysis pipeline begins with aligning sequencing reads to a reference genome, then using HMMRATAC to identify of accessible regions or "peaks" in the chromatin. Next motif enrichment can be performed with MEME, footprint identificaiton with CENTIPEDE, or differential analysis using Diffbind. Quality control measurments can take place after each of these steps using ATACseqQC.

This Bioconductor package has been made to utilize the original Java program for HMMRATAC and make it availiable for use within Bioconductor.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install('HMMRATAC')

Example Pipeline

The rest of this vignette will detail how to use this package and others for the purpose of performing several types analysis of on an ATACSeq sample to confirm some experimental results.

Our example stems from the article "β-Glucan Reverses the Epigenetic State of LPS-Induced Immunological Tolerance" [@novakovic2016β-glucan]. In this article, it can be understand that monocytes undergo functional reprogramming as apart of the innate immune response after exposure to microbial lipopolysaccharide (LPS). LPS-treated monocytes fail to accumulate active histone markers in promoters ad enhancers of lipid metabolism and phagocytic pathways. In contrast, β-glucan reverses the LPS-induced tolerance and reinduces their capacity for cytokine production.

An integrated epigenomic apprach is envoked to characterize the molecular events involved with LPS-induced tolerance in a time-dependant manner. Time-dependant ATACseq was performed on naive macrophages, LPS-exposed tolerized macrophages, and BG-exposed trained macrophages. Measurements were taken at one hour, 4 hour, 24 hour, and 6 day time frames.

This ATACseq data is avaible from the Short Reads Archieve (SRA)

In this example pipeline, we seek to validate the findings of the article by performing transcription factor and functional analysis with the following steps: - Sequence alignment of reads using Bowtie2 - Use of Rsamtools to generate bam files and file necessary for peak-calling - Quality control using ATACseqQC - Using HMMRATAC for peak-calling - ChIPSeekeR - Using diffbind to perform differential binding ananlysis between the treatments - A pipeline to identify significant transcription factors using MotifDb and TFBSTools - A pipeline for functional enrichment analysis using rGREAT - gViz

Due to the size of the files and the complexity of computations being performed, most of the functionality of the vignette is not meant to be run by the user in the context of an example. Instead, the results of the computations will be included at the end of their section.

Pleae load the data from the HMMRATACData package for these computations

library(HMMRATACData)
data(HMMRATACData)

Sequence Alignment Using Bowtie

We begin with a BAM file containing pair-end ATAC-seq reads. Normally, this alignment is performed by a sequence aligner such as BOWTIE2 or BWA, but for the scope of this workflow, we will simply give an example of how to run bowtie on pairend count data like what can be obtained from the SRA link above. The vignette does not require that the reader performs this step. The ${SRR} in this example bach command corresponds to the name of a sample. Information on the use of bowtie can be found here

bowtie --chunkmbs 8000 -S -t hg19 -1 ${SRR}.1.fastq -2 ${SRR}.2.fastq ${SRR}.sam

Generate necessary files using RSamtools

The Rsamtools package provides a range of functionality to read and manipulate bam files. Rsamtools is the Bioconductor implementation of Samtools.

Using the BAM file that we just loaded, we will generate the remaining two files needed to run HMMRATAC using Rsamtools. Our next step is to create an index file. Rsamtools can do this with a sorted BAM file. We generate an index file as follow:

library(Rsamtools)

## Generate sorted bam file and save to a temporary location.
bam <- sortBam(file = bam, destination = tempfile())

## Create the index file.
index <- indexBam(file = bam)

The last file that we will generate is the genome.info file. This files shows

## Get list of headers.
genome <- scanBamHeader(file = bam)

## Extract targets and filter list
targets <- genome[[bam]]$targets
idx <- grepl("\\.[0-9]+$", names(targets))
genome_info <- targets[!idx]
genome_info <- data.frame(genome_info)

## Write table.
genome <- tempfile()
write.table(genome_info, genome, col.names = FALSE, quote = FALSE)

ATACseqQC

ATACseqQC is a package used for quickly assesing the quality of ATAC-seq data. It specializes in the use of

After performing the initial alignment, the quality of the alignment can be assesed using ATACseqQC

library(ATACseqQC)
estimateLibComplexity(readsDupFreq(bam))

fragSize <- fragSizeDist(bam, "Frag Sizes")

Footprint identification can also be performed with ATACseqQC.

library(BSgenome.Hsapiens.UCSC.hg19)
library(MotifDb)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
seqlev <- "chr1"

sigs <- factorFootprints(bam, pfm=CTCF[[1]], 
                         genome=Hsapiens,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

featureAlignedHeatmap(sigs$signal, 
                      feature.gr=reCenterPeaks(sigs$bindingSites,
                                               width=200+width(sigs$bindingSites[1])), 
                      annoMcols="score",
                      sortBy="score",
                      n.tile=ncol(sigs$signal[[1]]))

Using HMMRATAC for Peak Calling

HMMRATAC requires a large allocation of memory to run that rJava is not able to initially allocate to it. The amount of memory needed varies by the size of the data being modeled. For this example, we allocate 8GB of memory before loading the HMMRATAC package:

options(java.parameters = "-Xmx8000m")
library(HMMRATAC)

Using the original bam file as well as the index and genome info files that we generated, we can run the HMMRATAC package's main function.

HMMRATAC(bam = bam, index = index, genome = genome)

HMMRATAC output files

The following files are generated in an output folder from a run of HMMRATAC:

Name.log

This is a log file generated as HMMRATAC runs. It contains all inputted arguments, the results of the fragment distribution EM algorithm, and various status updates as HMMRATAC progresses, as well as a textual description of the generated model. When finished, it also reports the total amount of actual (real-world) time it took for HMMRATAC to run.

Name.bedgraph

This is the genome-wide state annotation file created if the --bedgraph option is set to true. It is a four column, tab-delimited file, with the following fields:

Chromosome Name
Region Start (zero-based)
Region Stop (zero-based)
State annotation. This represents what state the particular region is assigned to, corresponding to the created model. The state name is prefixed with the letter “E”, to make extractions easier. For instance, to retrieve all regions that were classified as state 1, type:
grep E1 Name.bedgraph > State1_regions.bed

Name.model

This file contains the model that was generated by HMMRATAC and was used to decode the genome. It is always recommended to check this model after running HMMRATAC. Please note that this file is a binary file that can only be read by HMMRATAC. To see the actual textual description of the model, see the log file. If HMMRATAC runs to completion but produces poor results, it is usually the result of a faulty model. For instance, if the model shows “NaN” for all of the parameters, such as transition or emission probabilities, this usually means that something went wrong with either the choice of training regions or generating the signal tracks. This can occur with too low or too high of a fold-change range for choosing training regions, if the BED file of training regions was faulty or if certain signals need to be trimmed off (due to small numbers of larger fragments). Generally speaking, this poor model occurs when there is not enough separation in the signal tracks between the states. Common fixes include, trimming the signal tracks (option --trim), changing the number of states (option -k), choosing a different fold change range for training site identification (option -u and -l) or changing the list of previously annotated training regions (option -t). Additionally, checking the model ensures that the predicted model is created. Although rare, it is possible for a different state to be better indicative of the open state, than is normally the case (HMMRATAC sorts the model so the last or highest numbered state should be the open state). If this happens, then the resulting peak file may be incorrect. In such cases, it is recommended to report a bedgraph file and extract the proper state (of certain lengths and with the accompanying flanking regions) and manually create a peak BED file. We have tested numerous ATAC-seq datasets, using our default parameters and we generally don’t encounter problems, unless we radically change those settings.

Name_peaks.gappedPeak

This is the standard peak file reported by HMMRATAC by default (when the -p , --peaks option is set to true). The first line in the file is a track line for genome browsers. After that, it is a 15 column, tab-delimited file containg the following fields:

Chromosome Name
Peak Start (zero-based). This is the beginning of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the start of the upstream nucleosome. If there is no upstream nucleosome, this position represents the start of the open state.
Peak Stop (zero-based). This is the end of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the end of the downstream nucleosome. If there is no downstream nucleosome, this position represents the end of the open state.
Peak Name. Unique name for the peak, in the format “Peak_#”. For the high coverage regions that are excluded from Viterbi decoding (described in Section 2, under the -z , --zscore option), the name is in the format “HighCoveragePeak_#”. This allows the high coverage peaks to be easily identified and extracted.
Unused, denoted with “.”
Unused, denoted with “.”
Open state start (zero-based). This is the genomic position where the open state begins. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.
Open state stop (zero-based). This is the genomic position where the open state ends. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.
Color code for display. Set to 255,0,0. This would create a dark-blue color on a genome browser.
The number of sub-regions in the peak region. For standard peaks this is set to 3, indicating the open state region and the two flanking nucleosome state regions. For high coverage peaks, this is set to 1. Peaks that don’t have upstream and/or downstream nucleosomes will have different values as well.
Comma separated list of sub-region lengths. The first and last values are always set to one, making visualization easier. The middle value is the length of the open state region. For high coverage peaks, this is only a single value, denoting the length of the total region. If the peak lacks upstream and/or downstream nucleosomes, these values will reflect that.
Comma separated list of the (zero-based) starts of each sub-region, relative to the total region’s start. For instance, the first value is always 0, meaning 0 bp from the region start (column 2). The second value is the distance from the region start (column 2) to the open region start (column 7), etc.
Peak Score. This is the score for the peak, as determined by the scoring system option described in section 2 (option --score).
Unused, denoted as -1.
Unused, denoted as -1.

Name_summits.bed

This is a BED file containing the summits of each HMMRATAC peak. Summits are determined by finding the position within the open state region, whose Gaussian-smoothed read coverage is the maximum over the entire region. This is only outputted if a peak file is also outputted. If motif analysis is being conducted, it is recommended to use these summits as the points of interest, as the summits tend to be better indicative of TF binding as opposed to peak centers. This is a four column, tab-delimited BED file containing the following fields:

Chromosome Name
Summit Start (zero-based)
Summit Stop (zero-based)
Summit Name. This is the same name as the corresponding Peak name used in the peak file (column 4 of the Name_peaks.gappedPeak file).
Peak Score. Same score as in the .gappedPeak file.

DiffBind

The DiffBind package specializes in identifying sites that are differentially bound between two sample groups. It also contains functionality for processing peak sets, counting sequencing reads overlapping interals in peak sets, and identifying statistically significant differentially bound sites based on evidence of binding affinity.

Here, we create a sampleSheet that describes the treatments and provide it as input.

library(DiffBind)

dba <- dba(sampleSheet='sample_sheet.csv')
count <- dba.count(dba, summits=500)
contrast <- dba.contrast(count, categories=DBA_CONDITION)
analyze <- dba.analyze(contrast)
bed <- dba.report(analyze)

TFBS analysis

In this section, we analysis the enrichment of transcrption factors in our dataset. First, the JASPAR2018 database is accessed to find pwm information for humans. Then a hypergeometric test if performed.

library(MotifDb)
library(TFBSTools)
library(JASPAR2018)
library("BSgenome.Hsapiens.UCSC.hg19")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(GenomicFeatures)

## Step 1: Match JASPAR TFs to chromosomes
opts <- list()
opts['species'] <- '9606'
pfm <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(pfm)
#pwm <- pwm[1:10]

bs <- BSgenome.Hsapiens.UCSC.hg19
chrs <- lapply(1:22, function(i) {
    bs[[paste0('chr', i)]]
})

pro <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene)

## Step 2: Perform ORA to find signficant TFBS's
run_seq3 <- function(seqname, motif_id, pwms, bsgenome, peaks, promoters) {
    message("Motif ID: ", motif_id, " -- Seqname: ", seqname)
    ## Find binding sites on chromosome
    siteseq <- searchSeq(pwms[motif_id], bsgenome[[seqname]], seqname = seqname, min.score="80%", strand = "*")
    tfbs <- as(siteseq, "GRanges")
    promoters_chr <- promoters[seqnames(promoters) == seqname]
    peaks_chr <- peaks[seqnames(peaks) == seqname]

    ## Get promoters with TF
    idx <- promoters_chr %over% tfbs
    promoters_with_tf <- promoters_chr[idx]
    promoters_without_tf <- promoters_chr[!idx]

    ## Summary statistics for phyper
    npromoters_with_tf <- sum(countOverlaps(promoters_chr, tfbs) > 0)
    npromoters_without_tf <- length(promoters_chr) - npromoters_with_tf

    npromoters_with_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_with_tf) > 0)
    npromoters_without_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_without_tf) > 0)

    ## Return values
    c(q = npromoters_with_tf_in_open_chromatin,
      m = npromoters_with_tf,
      n = npromoters_without_tf,
      k = npromoters_with_tf_in_open_chromatin + npromoters_without_tf_in_open_chromatin)
}

perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) {
    message('Running Motif: ', motif_id)
    seqnames <- paste0('chr', 1:22)
    motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro)
    red <- Reduce(`+`, motif_result)

perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) {
    message('Running Motif: ', motif_id)
    seqnames <- paste0('chr', 1:22)
    motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro)
    red <- Reduce(`+`, motif_result)
    do.call(phyper, as.list(red))
}

run_motif_analysis <- function(peak_file_name, output_file_name) {
    motif_names <- names(pwm)
    motif_symbols <- vapply(pwm, function(motif) {motif@name}, character(1))
    total_results <- lapply(motif_names, perform_run_by_motif, pwm = pwm, bs = bs, peaks = peaks, pro = pro)
    total_results <- matrix(total_results)
    rownames(total_results) <- motif_symbols
    write.table(total_results, file = output_file_name, col.names = TRUE, quote=FALSE)
}

peak_LPS <- system.file("extdata", "sigs_LPS.bed", package="HMMRATAC")
peak_BG <- system.file("extdata", "sigs_BG.bed", package="HMMRATAC")
run_motif_analysis(peaks_LPS, "sig_motifs_LPS.txt")
run_motif_analysis(peals_BG, "sig_motifs_BG.txt")

The pre-computed results from the the HMMRATACData package are available. The significant transcription factors can be found as follows.

head(bg_motifs)
head(lps_motifs)

rGREAT functional analysis

rGREAT is a package that can read peak files and infer which pathways are enriched using functional enrichment analysis

library(rGREAT)
job <- submitGreatJob(bed)
tb <- getEnrichmentTables(job)
look <- tb[[1]]
res <- look[look$Hyper_Adjp_BH < 0.05,]

The pre-computed results from the the HMMRATACData package are available. The significant pathways can be found as follows.

head(bg_pathways)
head(lps_pathways)

ChIPseeker

The ChIPseeker package is meant to perform statistical testing of significant overlap amoung ChIP seq data sets. It also incorporates open access database GEO to allow users to compare their dataset to those availiable on GEO.

peak <- readPeakFile('65283_chr22_peaks.gappedPeak')

Display peak locations across entire genome

covplot(peak, weightCol="weights")

Perform peak annotation

peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")

plotAnnoPie(peakAnno)

upsetplot(peakAnno, vennpie=TRUE)

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci relative to TSS")

Session Info

sessionInfo()

Bioconductor/HMMRATAC documentation built on July 31, 2020, 3:07 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Bioconductor/HMMRATAC
HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models

In Bioconductor/HMMRATAC: HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models

Introduction

Installation

Example Pipeline

Sequence Alignment Using Bowtie

Generate necessary files using RSamtools

ATACseqQC

Using HMMRATAC for Peak Calling

HMMRATAC output files

Name.log

Name.bedgraph

Name.model

Name_peaks.gappedPeak

Name_summits.bed

DiffBind

TFBS analysis

rGREAT functional analysis

ChIPseeker

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/HMMRATAC HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models

In Bioconductor/HMMRATAC: HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models

Introduction

Installation

Example Pipeline

Sequence Alignment Using Bowtie

Generate necessary files using RSamtools

ATACseqQC

Using HMMRATAC for Peak Calling

HMMRATAC output files

Name.log

Name.bedgraph

Name.model

Name_peaks.gappedPeak

Name_summits.bed

DiffBind

TFBS analysis

rGREAT functional analysis

ChIPseeker

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

Bioconductor/HMMRATAC
HMMRATAC - a semi-supervised machine learning approach for identifying open chromatin regions from ATAC-Seq data using Hidden Markov Models