library(BiocStyle)
The bam2bw
function can be used to compute per-nucleotide or per-bin coverage
from alignments and save it to a bigwig file. In this process, information about
individual reads is lost, but the produced signals are considerably more
lightweight and amenable to visualization. The bigwig format is readily queried
from R or compatible with a variety of tools, including genome browsers.
To introduce the different variations on coverage, let's assume you've go the following single-end reads:
suppressPackageStartupMessages(library(epiwraps)) # we create some arbitrary genomic ranges gr <- GRanges("chr1", IRanges(c(30,70,120), width=50), strand=c("+","+","-"), seqlengths=c(chr1=500)) plotSignalTracks(list(reads=gr), region="chr1:1:180", extend=0, genomeAxis=FALSE)
For testing purposes, we'll save this as a bam file and index it:
bam <- tempfile(fileext = ".bam") # temp file name rtracklayer::export(gr, bam, format="bam") Rsamtools::indexBam(bam)
Using these example reads, we can illustrate different ways of computing coverages.
First, we can save coverage at different resolutions, from full resolution (each nucleotide is a single bin) to larger bin sizes, the latter giving smaller filesizes:
# Full coverage with bin width of 1 nucleotide (i.e. full resolution) cov_full_bw1 <- tempfile(fileext = ".bw") # temp file name bam2bw(bam, cov_full_bw1, binWidth=1L, scaling=FALSE) # Full coverage with larger bins cov_full_bw25 <- tempfile(fileext = ".bw") bam2bw(bam, cov_full_bw25, binWidth=25L, scaling=FALSE) plotSignalTracks(list(reads=gr, "binWidth=1"=cov_full_bw1, "binWidth=25"=cov_full_bw25), region="chr1:1:180", extend=0)
Both tracks compile the number of reads that overlap each position, but in the bottom track the signal is by chunks of 25 nucleotides. By default, the maximum signal inside a bin is used, however it is possible to change this:
# Using mean per bin: cov_full_bw25mean <- tempfile(fileext = ".bw") bam2bw(bam, cov_full_bw25mean, binWidth=25L, binSummarization = "mean", scaling=FALSE) plotSignalTracks(list(reads=gr, "binWidth=1"=cov_full_bw1, "binWidth=25\n(max)"=cov_full_bw25, "binWidth=25\n(mean)"=cov_full_bw25mean), region="chr1:1:180", extend=0)
In most cases, single-end reads are just the beginning of the DNA fragments
obtained, and we know the average size of the fragments from library prep QC
(if not, see the estimateFragSize
function, or the simpler estimate.mean.fraglen
function of the r BiocStyle::Biocpkg("chipseq")
package). It is therefore
common to extend reads to this size when computing coverage, so as to obtain
the number of fragments (rather than reads) coverage each position. This can be
done as follows:
# Here the reads are 50bp, and we want to extend them to 100bp, hence _by_ 50: cov_full_ext <- tempfile(fileext = ".bw") bam2bw(bam, cov_full_ext, binWidth=1L, extend=50L, scaling=FALSE) plotSignalTracks(list(reads=gr, "no extension"=cov_full_bw1, "read extension"=cov_full_ext), region="chr1:1:190", extend=0)
Taking into account read extension shows better the peak at the center, which is indicative of the fact that this is the region that was the most enriched in the captured fragments.
Instead of computing coverage, we could compute the number of reads starting, ending, or being centered at each position:
# Here the reads are 50bp, and we want to extend them to 100bp, hence _by_ 50: cov_start <- tempfile(fileext = ".bw") bam2bw(bam, cov_start, binWidth=1L, extend=50L, scaling=FALSE, type="start") cov_center <- tempfile(fileext = ".bw") bam2bw(bam, cov_center, binWidth=1L, extend=50L, scaling=FALSE, type="center") plotSignalTracks(list(reads=gr, "type=full"=cov_full_bw1, "type=start"=cov_start, "type=center"=cov_center), region="chr1:1:190", extend=0)
Note that when extending reads, as in this case, the position (e.g. "center") are relative to the extended read (i.e. extension is applied first).
The following figure, created using epiwraps
(see
the vignette on generating such plots),
represent chromatin accessibility (ATAC-seq) signals around bound CTCF motifs
in T-cells. The different signals are based on different bigwig files derived
from the same bam file using the functions described above.
shift=c(4L,-5L), minFragLength=147, maxFragLength=230, type="center", extend=10L
.
Using this we can see nucleosomes well-positioned at some distance from CTCF
binding sites, but a relative depletion of nucleosomes at the bound site itself.shift=c(4L,-5L), maxFragLength=120
. These are indicative of TF
binding, and indeed we only see and enrichment in the center.shift=c(4L,-5L), binWidth=1L, maxFragLength=120, type="ends"
. Using this we
can see a nice footprint protected from the transposase by CTCF binding.ATAC fragment sizes convey rich information, but they are not perfect. For example, regions of DNA bound by multiple TFs can sometimes be captured as longer fragments which we would mistakenly classify as mono-nucleosome containing.
If you use fragment files (preferably tabix-indexed) rather than bam files as
input, you can still perform most of the above tasks. See the ?frag2bw
function for more information.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.