knitr::opts_chunk$set(echo = TRUE)

Overview

peaksat is a barebones R package to do Peak Saturation analysis.

Peak Saturation is a part of the quality control process and is useful for determining the potential benifit of deeper sequencing for improving peak calls. When libraries are too shallow, only a semi-random subset of peaks are called. This package provides methods for randomly subsetting bam files at fixed depth intervals and calling peaks via macs2 on each. As a result we can produce a curve of peak count vs read count and (with enough reads) estimate the total number of peaks that are potentially callable in each sequecning library. This also allows us to determine that either the library has reached saturation or roughly estimate how many reads would be required to reach saturation.

It requires an existing isntallation of macs2, which can be on your PATH or specified explicity when calling submit_peaksat_jobs() or via the PS_MACS2_PATH option.

It also currently requires SGE (Sun Grid Engine) to run.

Basically, just run it on Stein Galaxy server for now and everything should be fine.

Usage

Setup

A vector of treatment (this is macs2's term for the pulldown samples, not whatever treatment an experiment may be testing) bam files matched with control (macs2's term for input or non-pulldown) bam files. If the same control bam is applicable to all treatment bams, only a single control bam is sufficient.

library(peaksat)
library(magrittr)
#set optional options
#peaksat uses the macs2 on your PATH by default but can be overriden by setting the PS_MACS2_PATH option
PS_OPTIONS$PS_MACS2_PATH = "/usr/local/bin/macs2"
#peaksat will output to you current directory or use this option is set
PS_OPTIONS$PS_OUTDIR = "~/peaksat_overview"

Next we need to make a configuration object.

#Importantly, these options will not impact any configuration objects already created.
ps = peaksat_config(stat = PS_STATS$qValue, stat_value = .01, is_PE = FALSE)
#Explicitly setting out_dir and macs2_path will override settings in the options.
ps.2 = peaksat_config(stat = PS_STATS$qValue, stat_value = .01, is_PE = FALSE, 
                      out_dir = "~/peak_saturation2", 
                      macs2_path = "/slipstream/home/joeboyd/anaconda2/bin/macs2")

And the last thing we need is bam files.

#This is just basic file matching.
treat_files = "/slipstream/galaxy/uploads/working/qc_framework/output" %>%
  dir(pattern = "MCF.+_H3K4ME3", full.names = TRUE) %>%
  dir(pattern = ".bam$", full.names = TRUE)

#Always use the deepest input available per sample.  Input depth has a big impact on peak calling.
#Although we have individual input replicates we could use, we just want the pooled versions.
input_files = "/slipstream/galaxy/uploads/working/qc_framework/output" %>%
  dir(pattern = "MCF.+_input_pool", full.names = TRUE) %>%
  dir(pattern = ".bam$", full.names = TRUE)
#We have 3 treatment bams per cell line.
input_files = rep(input_files, each = 3)

Run Peak Saturation

Finally we can submit the subset/peak call jobs.

#await_completion = TRUE will hold up R until all submitted jobs have completed. This is useful so you can wait to run the remainder of a script that depends on peaksat completing.
#if you don't want that, set await_completion = FALSE
jids = submit_peaksat_jobs(ps, treat_bams = treat_files, ctrl_bams = input_files, await_completion = TRUE)

Once all jobs have finished we can make plots, but first we need to load the read and peak count data like so:

cnt_dt = load_counts(ps)
head(cnt_dt)

Plots

This data is tidy formatted to be usable with ggplot.

There are a couple included plotting functions.

plot_peak_saturation_lines(cnt_dt)
plot_peak_saturation_lines.facetted(cnt_dt)

For more control of these included plot functions, simply filter the input data.table

As an example, we'll just look at the pooled samples. In my opinion the 2 pooled samples here have begun to saturate.

plot_peak_saturation_lines(cnt_dt[grepl("pool", sample)])

Or just the MCF10A reps.

plot_peak_saturation_lines(cnt_dt[grepl("MCF10A.+_R[12]", sample)])

To do more sophisticated plots, you can simply work with cnt_dt and use ggplot.

library(data.table)
library(ggplot2)
cnt_dt[, c("cell", "mark", "rep") := tstrsplit(sample, "_", keep = 1:3)]
cnt_dt[, is_max := read_count == max(read_count), .(sample)]
lab_dt = cnt_dt[is_max == TRUE]
lab_dt
ggplot(cnt_dt, aes(x = read_count, y = peak_count, group = sample, color = cell)) +
  geom_path() +
  geom_point(data = lab_dt) +
  # geom_text(data = lab_dt, aes(label = paste(cell, mark, rep), x = read_count + 800e3), hjust = 0, show.legend = FALSE) +
  scale_x_continuous(expand = expansion(c(.1, .5)), labels = function(x)x/1e6) +
  scale_y_continuous(labels = function(x)x/1e3) +
  labs(x = "reads (M)", y = "peaks (k)") +
  cowplot::theme_cowplot() +
  scale_color_brewer(palette = "Set1") +
  facet_wrap(~rep)
ggplot(cnt_dt, aes(x = read_count, y = peak_count, group = sample, color = rep)) +
  geom_path() +
  geom_point(data = lab_dt) +
  # geom_text(data = lab_dt, aes(label = paste(cell, mark, rep), x = read_count + 800e3), hjust = 0, show.legend = FALSE) +
  scale_x_continuous(expand = expansion(c(.1, .5)), labels = function(x)x/1e6) +
  scale_y_continuous(labels = function(x)x/1e3) +
  labs(x = "reads (M)", y = "peaks (k)") +
  cowplot::theme_cowplot() +
  scale_color_brewer(palette = "Dark2") +
  facet_wrap(~cell)

Depth Estimate

The final piece of this analysis is to estimate the read depth required for saturation.

peaksat has two methods for this purpose. One uses linear regression, the other logarithmic regression.

Linear

Run like this :

depth_estimate.lin = estimate_depth.linear(cnt_dt, min_read_count = 15e6)

Results in a table.

DT::datatable(depth_estimate.lin$estimates[, .(sample, peak_count_k = round(saturation_peak_count / 1e3, 2), read_count_M = round(saturation_read_count / 1e6, 2))])

... and plots.

depth_estimate.lin$plots

Logarithmic

Run like this :

depth_estimate.log = estimate_depth.log(cnt_dt, min_read_count = 5e6, target_peaks = 18e3)

Results in a table.

DT::datatable(depth_estimate.log$estimates[, .(sample, peak_count_k = round(saturation_peak_count / 1e3, 2), read_count_M = round(saturation_read_count / 1e6, 2))])

... and plots.

cowplot::plot_grid(plotlist = depth_estimate.log$plots)


FrietzeLabUVM/peaksat documentation built on Oct. 2, 2024, 7:07 p.m.