suppressPackageStartupMessages({ library(ATACseqQC) library(ChIPpeakAnno) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(MotifDb) library(GenomicAlignments) }) knitr::opts_chunk$set(warning=FALSE, message=FALSE)
Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is a method to identify open chromatin regions in a genome-wide manner. This method involves using a transposase to cut sections of accessable DNA selectively [@buenostro2013transposition]. ATAC-seq identifies regions highly corrrelated with similar methods used to identify these regions, DNase-seq [@boyle2008high] and FAIRE-seq [@giresi2007faire]. ATAC-seq has become a popular method for this task because it is faster and easier to perform, has a higher signal to noise ration, and does not require as large of an amount of biological samples when comapred with other methods [@buenostro2013transposition].
Hidden Markov ModeleR for ATAC-seq (HMMRATAC) is a semi-supervised machine learning appraoch for identifying open chromatin regions from ATAC-seq data. The idea behind HMMRATAC is conceptualized upon the idea of "decomposition and integration. The ATAC-seq data is decomposed into different layers of coverage signals corresponding to the sequnced DNA fragments originated from nucleosomal regions. The relationship between the layers of signals at open chromatin regions is then learned using a Hidden Markov Model and utilized for predicting open chromatin. HMMRATAC is capable of outperforming similar methods in identifying chromatin structure and transcription factor binding sites. [@hmmratac2019tarbell]
A typical analysis pipeline begins with aligning sequencing reads to a reference genome, then using HMMRATAC to identify of accessible regions or "peaks" in the chromatin. Next motif enrichment can be performed with MEME, footprint identificaiton with CENTIPEDE, or differential analysis using Diffbind. Quality control measurments can take place after each of these steps using ATACseqQC.
This Bioconductor package has been made to utilize the original Java program for HMMRATAC and make it availiable for use within Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install('HMMRATAC')
The rest of this vignette will detail how to use this package and others for the purpose of performing several types analysis of on an ATACSeq sample to confirm some experimental results.
Our example stems from the article "β-Glucan Reverses the Epigenetic State of LPS-Induced Immunological Tolerance" [@novakovic2016β-glucan]. In this article, it can be understand that monocytes undergo functional reprogramming as apart of the innate immune response after exposure to microbial lipopolysaccharide (LPS). LPS-treated monocytes fail to accumulate active histone markers in promoters ad enhancers of lipid metabolism and phagocytic pathways. In contrast, β-glucan reverses the LPS-induced tolerance and reinduces their capacity for cytokine production.
An integrated epigenomic apprach is envoked to characterize the molecular events involved with LPS-induced tolerance in a time-dependant manner. Time-dependant ATACseq was performed on naive macrophages, LPS-exposed tolerized macrophages, and BG-exposed trained macrophages. Measurements were taken at one hour, 4 hour, 24 hour, and 6 day time frames.
This ATACseq data is avaible from the Short Reads Archieve (SRA)
In this example pipeline, we seek to validate the findings of the article by performing transcription factor and functional analysis with the following steps: - Sequence alignment of reads using Bowtie2 - Use of Rsamtools to generate bam files and file necessary for peak-calling - Quality control using ATACseqQC - Using HMMRATAC for peak-calling - ChIPSeekeR - Using diffbind to perform differential binding ananlysis between the treatments - A pipeline to identify significant transcription factors using MotifDb and TFBSTools - A pipeline for functional enrichment analysis using rGREAT - gViz
Due to the size of the files and the complexity of computations being performed, most of the functionality of the vignette is not meant to be run by the user in the context of an example. Instead, the results of the computations will be included at the end of their section.
Pleae load the data from the HMMRATACData
package for these computations
library(HMMRATACData) data(HMMRATACData)
We begin with a BAM file containing pair-end ATAC-seq reads. Normally, this
alignment is performed by a sequence aligner such as BOWTIE2 or BWA, but for the
scope of this workflow, we will simply give an example of how to run bowtie on
pairend count data like what can be obtained from the SRA link above. The
vignette does not require that the reader performs this step.
The ${SRR}
in this example bach command corresponds to the name of a sample.
Information on the use of bowtie can be found here
bowtie --chunkmbs 8000 -S -t hg19 -1 ${SRR}.1.fastq -2 ${SRR}.2.fastq ${SRR}.sam
The Rsamtools package provides a range of functionality to read and manipulate bam files. Rsamtools is the Bioconductor implementation of Samtools.
Using the BAM file that we just loaded, we will generate the remaining two files needed to run HMMRATAC using Rsamtools. Our next step is to create an index file. Rsamtools can do this with a sorted BAM file. We generate an index file as follow:
library(Rsamtools) ## Generate sorted bam file and save to a temporary location. bam <- sortBam(file = bam, destination = tempfile()) ## Create the index file. index <- indexBam(file = bam)
The last file that we will generate is the genome.info
file. This files shows
## Get list of headers. genome <- scanBamHeader(file = bam) ## Extract targets and filter list targets <- genome[[bam]]$targets idx <- grepl("\\.[0-9]+$", names(targets)) genome_info <- targets[!idx] genome_info <- data.frame(genome_info) ## Write table. genome <- tempfile() write.table(genome_info, genome, col.names = FALSE, quote = FALSE)
ATACseqQC is a package used for quickly assesing the quality of ATAC-seq data. It specializes in the use of
After performing the initial alignment, the quality of the alignment can be assesed using ATACseqQC
library(ATACseqQC) estimateLibComplexity(readsDupFreq(bam))
fragSize <- fragSizeDist(bam, "Frag Sizes")
Footprint identification can also be performed with ATACseqQC.
library(BSgenome.Hsapiens.UCSC.hg19) library(MotifDb) CTCF <- query(MotifDb, c("CTCF")) CTCF <- as.list(CTCF) seqlev <- "chr1" sigs <- factorFootprints(bam, pfm=CTCF[[1]], genome=Hsapiens, min.score="90%", seqlev=seqlev, upstream=100, downstream=100) featureAlignedHeatmap(sigs$signal, feature.gr=reCenterPeaks(sigs$bindingSites, width=200+width(sigs$bindingSites[1])), annoMcols="score", sortBy="score", n.tile=ncol(sigs$signal[[1]]))
HMMRATAC requires a large allocation of memory to run that rJava
is not able to initially allocate to it. The amount of memory needed varies by
the size of the data being modeled. For this example, we allocate 8GB of memory
before loading the HMMRATAC package:
options(java.parameters = "-Xmx8000m") library(HMMRATAC)
Using the original bam file as well as the index and genome info files that we generated, we can run the HMMRATAC package's main function.
HMMRATAC(bam = bam, index = index, genome = genome)
The following files are generated in an output folder from a run of HMMRATAC:
This is a log file generated as HMMRATAC runs. It contains all inputted arguments, the results of the fragment distribution EM algorithm, and various status updates as HMMRATAC progresses, as well as a textual description of the generated model. When finished, it also reports the total amount of actual (real-world) time it took for HMMRATAC to run.
This is the genome-wide state annotation file created if the --bedgraph option is set to true. It is a four column, tab-delimited file, with the following fields:
Chromosome Name
Region Start (zero-based)
Region Stop (zero-based)
State annotation. This represents what state the particular region is assigned to, corresponding to the created model. The state name is prefixed with the letter “E”, to make extractions easier. For instance, to retrieve all regions that were classified as state 1, type:
grep E1 Name.bedgraph > State1_regions.bed
This file contains the model that was generated by HMMRATAC and was used to decode the genome. It is always recommended to check this model after running HMMRATAC. Please note that this file is a binary file that can only be read by HMMRATAC. To see the actual textual description of the model, see the log file. If HMMRATAC runs to completion but produces poor results, it is usually the result of a faulty model. For instance, if the model shows “NaN” for all of the parameters, such as transition or emission probabilities, this usually means that something went wrong with either the choice of training regions or generating the signal tracks. This can occur with too low or too high of a fold-change range for choosing training regions, if the BED file of training regions was faulty or if certain signals need to be trimmed off (due to small numbers of larger fragments). Generally speaking, this poor model occurs when there is not enough separation in the signal tracks between the states. Common fixes include, trimming the signal tracks (option --trim), changing the number of states (option -k), choosing a different fold change range for training site identification (option -u and -l) or changing the list of previously annotated training regions (option -t). Additionally, checking the model ensures that the predicted model is created. Although rare, it is possible for a different state to be better indicative of the open state, than is normally the case (HMMRATAC sorts the model so the last or highest numbered state should be the open state). If this happens, then the resulting peak file may be incorrect. In such cases, it is recommended to report a bedgraph file and extract the proper state (of certain lengths and with the accompanying flanking regions) and manually create a peak BED file. We have tested numerous ATAC-seq datasets, using our default parameters and we generally don’t encounter problems, unless we radically change those settings.
This is the standard peak file reported by HMMRATAC by default (when the -p , --peaks option is set to true). The first line in the file is a track line for genome browsers. After that, it is a 15 column, tab-delimited file containg the following fields:
Chromosome Name
Peak Start (zero-based). This is the beginning of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the start of the upstream nucleosome. If there is no upstream nucleosome, this position represents the start of the open state.
Peak Stop (zero-based). This is the end of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the end of the downstream nucleosome. If there is no downstream nucleosome, this position represents the end of the open state.
Peak Name. Unique name for the peak, in the format “Peak_#”. For the high coverage regions that are excluded from Viterbi decoding (described in Section 2, under the -z , --zscore option), the name is in the format “HighCoveragePeak_#”. This allows the high coverage peaks to be easily identified and extracted.
Unused, denoted with “.”
Unused, denoted with “.”
Open state start (zero-based). This is the genomic position where the open state begins. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.
Open state stop (zero-based). This is the genomic position where the open state ends. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.
Color code for display. Set to 255,0,0. This would create a dark-blue color on a genome browser.
The number of sub-regions in the peak region. For standard peaks this is set to 3, indicating the open state region and the two flanking nucleosome state regions. For high coverage peaks, this is set to 1. Peaks that don’t have upstream and/or downstream nucleosomes will have different values as well.
Comma separated list of sub-region lengths. The first and last values are always set to one, making visualization easier. The middle value is the length of the open state region. For high coverage peaks, this is only a single value, denoting the length of the total region. If the peak lacks upstream and/or downstream nucleosomes, these values will reflect that.
Comma separated list of the (zero-based) starts of each sub-region, relative to the total region’s start. For instance, the first value is always 0, meaning 0 bp from the region start (column 2). The second value is the distance from the region start (column 2) to the open region start (column 7), etc.
Peak Score. This is the score for the peak, as determined by the scoring system option described in section 2 (option --score).
Unused, denoted as -1.
Unused, denoted as -1.
This is a BED file containing the summits of each HMMRATAC peak. Summits are determined by finding the position within the open state region, whose Gaussian-smoothed read coverage is the maximum over the entire region. This is only outputted if a peak file is also outputted. If motif analysis is being conducted, it is recommended to use these summits as the points of interest, as the summits tend to be better indicative of TF binding as opposed to peak centers. This is a four column, tab-delimited BED file containing the following fields:
Chromosome Name
Summit Start (zero-based)
Summit Stop (zero-based)
Summit Name. This is the same name as the corresponding Peak name used in the peak file (column 4 of the Name_peaks.gappedPeak file).
Peak Score. Same score as in the .gappedPeak file.
The DiffBind package specializes in identifying sites that are differentially bound between two sample groups. It also contains functionality for processing peak sets, counting sequencing reads overlapping interals in peak sets, and identifying statistically significant differentially bound sites based on evidence of binding affinity.
Here, we create a sampleSheet that describes the treatments and provide it as input.
library(DiffBind) dba <- dba(sampleSheet='sample_sheet.csv') count <- dba.count(dba, summits=500) contrast <- dba.contrast(count, categories=DBA_CONDITION) analyze <- dba.analyze(contrast) bed <- dba.report(analyze)
In this section, we analysis the enrichment of transcrption factors in our dataset. First, the JASPAR2018 database is accessed to find pwm information for humans. Then a hypergeometric test if performed.
library(MotifDb) library(TFBSTools) library(JASPAR2018) library("BSgenome.Hsapiens.UCSC.hg19") library("TxDb.Hsapiens.UCSC.hg19.knownGene") library(GenomicFeatures) ## Step 1: Match JASPAR TFs to chromosomes opts <- list() opts['species'] <- '9606' pfm <- getMatrixSet(JASPAR2018, opts) pwm <- toPWM(pfm) #pwm <- pwm[1:10] bs <- BSgenome.Hsapiens.UCSC.hg19 chrs <- lapply(1:22, function(i) { bs[[paste0('chr', i)]] }) pro <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene) ## Step 2: Perform ORA to find signficant TFBS's run_seq3 <- function(seqname, motif_id, pwms, bsgenome, peaks, promoters) { message("Motif ID: ", motif_id, " -- Seqname: ", seqname) ## Find binding sites on chromosome siteseq <- searchSeq(pwms[motif_id], bsgenome[[seqname]], seqname = seqname, min.score="80%", strand = "*") tfbs <- as(siteseq, "GRanges") promoters_chr <- promoters[seqnames(promoters) == seqname] peaks_chr <- peaks[seqnames(peaks) == seqname] ## Get promoters with TF idx <- promoters_chr %over% tfbs promoters_with_tf <- promoters_chr[idx] promoters_without_tf <- promoters_chr[!idx] ## Summary statistics for phyper npromoters_with_tf <- sum(countOverlaps(promoters_chr, tfbs) > 0) npromoters_without_tf <- length(promoters_chr) - npromoters_with_tf npromoters_with_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_with_tf) > 0) npromoters_without_tf_in_open_chromatin <- sum(countOverlaps(peaks_chr, promoters_without_tf) > 0) ## Return values c(q = npromoters_with_tf_in_open_chromatin, m = npromoters_with_tf, n = npromoters_without_tf, k = npromoters_with_tf_in_open_chromatin + npromoters_without_tf_in_open_chromatin) } perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) { message('Running Motif: ', motif_id) seqnames <- paste0('chr', 1:22) motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro) red <- Reduce(`+`, motif_result) perform_run_by_motif <- function(motif_id, pwm, bs, peaks, pro) { message('Running Motif: ', motif_id) seqnames <- paste0('chr', 1:22) motif_result <- lapply(seqnames, run_seq3, motif_id = motif_id, pwms = pwm, bsgenome = bs, peaks = peaks, promoters = pro) red <- Reduce(`+`, motif_result) do.call(phyper, as.list(red)) } run_motif_analysis <- function(peak_file_name, output_file_name) { motif_names <- names(pwm) motif_symbols <- vapply(pwm, function(motif) {motif@name}, character(1)) total_results <- lapply(motif_names, perform_run_by_motif, pwm = pwm, bs = bs, peaks = peaks, pro = pro) total_results <- matrix(total_results) rownames(total_results) <- motif_symbols write.table(total_results, file = output_file_name, col.names = TRUE, quote=FALSE) } peak_LPS <- system.file("extdata", "sigs_LPS.bed", package="HMMRATAC") peak_BG <- system.file("extdata", "sigs_BG.bed", package="HMMRATAC") run_motif_analysis(peaks_LPS, "sig_motifs_LPS.txt") run_motif_analysis(peals_BG, "sig_motifs_BG.txt")
The pre-computed results from the the HMMRATACData
package are available.
The significant transcription factors can be found as follows.
head(bg_motifs) head(lps_motifs)
rGREAT
is a package that can read peak files and infer which pathways are
enriched using functional enrichment analysis
library(rGREAT) job <- submitGreatJob(bed) tb <- getEnrichmentTables(job) look <- tb[[1]] res <- look[look$Hyper_Adjp_BH < 0.05,]
The pre-computed results from the the HMMRATACData
package are available.
The significant pathways can be found as follows.
head(bg_pathways) head(lps_pathways)
The ChIPseeker package is meant to perform statistical testing of significant overlap amoung ChIP seq data sets. It also incorporates open access database GEO to allow users to compare their dataset to those availiable on GEO.
peak <- readPeakFile('65283_chr22_peaks.gappedPeak')
Display peak locations across entire genome
covplot(peak, weightCol="weights")
Perform peak annotation
peakAnno <- annotatePeak(files[[4]], tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db") plotAnnoPie(peakAnno) upsetplot(peakAnno, vennpie=TRUE) plotDistToTSS(peakAnno, title="Distribution of transcription factor-binding loci relative to TSS")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.