knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE, fig.width = 4, fig.height = 5, fig.show = "hold", global.par = FALSE )
The epistack
package main objective is the visualizations of stacks
of genomic tracks (such as, but not restricted to, ChIP-seq or
DNA methyation data)
centered at genomic regions of interest. epistack
needs three
different inputs:
GRanges
(easily obtained from bigwig
or bam
files)GRanges
(easily obtained from gtf
or bed
files)Each inputs are then combined in a single RangedSummarizedExperiment
object use by epistack's
ploting functions.
After introducing epistack's plotting capacity, this document will present two use cases:
Epistack is a visualisation package. It uses a RangedSummarizedExperiment
object as input, with
matrices embeded as assays. We will discuss how to
build such input obects in the next section. For now on, we will focus on
the visualisation functions using the example dataset included in the package.
The dataset can be accessed with:
library(GenomicRanges) library(SummarizedExperiment) library(epistack) data("stackepi") stackepi
plotEpisatck()
functionThis dataset can be visualised with the plotEpistack()
function.
The first parameter is the input RangedSummarizedExperiment
object.
The second (optionnal) parameter, assays
specifies which assay(s)
should be displayed as heatmap(s). In the stackepi
dataset, only one track is present, with an assay names DNAme
.
Note that it is possible to have several different tracks embeded in the
same RangedSummarizedExperiment
object, as demonstarted in the next sections.
An aditional metric_col
is used, to display score associated with each
anchor region, such as expression values or peak scores. Optionaly,
the metric_col
can be transformed before ploting using the
metric_transfunc
parameters.
plotEpistack( stackepi, assay = "DNAme", metric_col = "exp", ylim = c(0, 1), zlim = c(0, 1), x_labels = c("-2.5kb", "TSS", "+2.5kb"), titles = "DNA methylation", legends = "%mCpG", metric_title = "Expression", metric_label = "log10(TPM+1)", metric_transfunc = function(x) log10(x+1) )
If a bin
column is present, it is used to generate one average profile per bin.
stackepi <- addBins(stackepi, nbins = 5) plotEpistack( stackepi, assay = "DNAme", metric_col = "exp", ylim = c(0, 1), zlim = c(0, 1), x_labels = c("-2.5kb", "TSS", "+2.5kb"), titles = "DNA methylation", legends = "%mCpG", metric_title = "Expression", metric_label = "log10(TPM+1)", metric_transfunc = function(x) log10(x+1) )
Colours can be changed using dedicated parameters:
plotEpistack( stackepi, assay = "DNAme", metric_col = "exp", ylim = c(0, 1), zlim = c(0, 1), x_labels = c("-2.5kb", "TSS", "+2.5kb"), titles = "DNA methylation", legends = "%mCpG", metric_title = "Expression", metric_label = "log10(TPM+1)", metric_transfunc = function(x) log10(x+1), tints = "dodgerblue", bin_palette = rainbow )
Text size, and other graphical parameters, can be changed using cex
inside of
plotEpistack()
. Indeed, additional arguments will be passed internaly to
par()
(see ?par
for more details).
plotEpistack( stackepi, assay = "DNAme", metric_col = "exp", ylim = c(0, 1), zlim = c(0, 1), x_labels = c("-2.5kb", "TSS", "+2.5kb"), titles = "DNA methylation", legends = "%mCpG", metric_title = "Expression", metric_label = "log10(TPM+1)", metric_transfunc = function(x) log10(x+1), cex = 0.4, cex.main = 0.6 )
Each panel can be plotted individually using dedicated functions. For example:
plotAverageProfile( stackepi, ylim = c(0, 1), assay = "DNAme", x_labels = c("-2.5kb", "TSS", "+2.5kb"), )
And:
plotStackProfile( stackepi, assay = "DNAme", x_labels = c("-2.5kb", "TSS", "+2.5kb"), palette = hcl.colors, zlim = c(0, 1) )
It is therefore possible to arrange panels as you whish, using the multipanel framework of your choice (layout, grid, patchwork, etc.).
layout(matrix(1:3, ncol = 1), heights = c(1.5, 3, 0.5)) old_par <- par(mar = c(2.5, 4, 0.6, 0.6)) plotAverageProfile( stackepi, assay = "DNAme", x_labels = c("-2.5kb", "TSS", "+2.5kb"), ylim = c(0, 1), ) plotStackProfile( stackepi, assay = "DNAme", x_labels = c("-2.5kb", "TSS", "+2.5kb"), zlim = c(0, 1), palette = hcl.colors ) plotStackProfileLegend( zlim = c(0, 1), palette = hcl.colors ) par(old_par) layout(1)
In this part, we will use example ChIP-seq data from the 2016 CSAMA course "Basics of ChIP-seq data analysis" by Aleksandra Pekowska and Simon Anders. Data can be found in this github repository: github.com/Bioconductor/CSAMA2016/tree/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles
There is no need to download the data as it can be remotely parsed.
The data consists of two H3K27ac ChIP-seq replicates, an input control,
and one list of peak for each replicates. It has been generated in
mouse Embryonic Stem cells and been subseted to have only data from chromosome 6
to allow fast vignette generation (but epistack
can deal with whole genome
ChIP-seq data!).
path_reads <- c( rep1 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/H3K27ac_rep1_filtered_ucsc_chr6.bed", rep2 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/H3K27ac_rep2_filtered_ucsc_chr6.bed", input = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/ES_input_filtered_ucsc_chr6.bed" ) path_peaks <- c( peak1 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/Rep1_peaks_ucsc_chr6.bed", peak2 = "https://raw.githubusercontent.com/Bioconductor/CSAMA2016/master/lab-5-chipseq/EpigeneticsCSAMA/inst/bedfiles/Rep2_peaks_ucsc_chr6.bed" )
We first read the peaks using r Biocpkg("rtracklayer")
:
library(rtracklayer) peaks <- lapply(path_peaks, import)
Peaks from each replicates can be merged using r Biocpkg("GenomicRanges")
union()
function (loaded with r Biocpkg("rtracklayer")
).
We then rescue peaks metadata, and compute a new mean_score
that we use
to arrange our peak list.
merged_peaks <- GenomicRanges::union(peaks[[1]], peaks[[2]]) scores_rep1 <- double(length(merged_peaks)) scores_rep1[findOverlaps(peaks[[1]], merged_peaks, select = "first")] <- peaks[[1]]$score scores_rep2 <- double(length(merged_peaks)) scores_rep2[findOverlaps(peaks[[2]], merged_peaks, select = "first")] <- peaks[[2]]$score peak_type <- ifelse( scores_rep1 != 0 & scores_rep2 != 0, "Both", ifelse( scores_rep1 != 0, "Rep1 only", "Rep2 only" ) ) mcols(merged_peaks) <- DataFrame(scores_rep1, scores_rep2, peak_type) merged_peaks$mean_scores <- apply((mcols(merged_peaks)[, c("scores_rep1", "scores_rep2")]), 1, mean) merged_peaks <- merged_peaks[order(merged_peaks$mean_scores, decreasing = TRUE), ] rm(scores_rep1, scores_rep2, peak_type) merged_peaks
We then import the ChIP-seq reads. In the example datasets,
they are provided as .bed files, but for ChIP-seq we recommand
importing .bigwig coverage files. Bam files can also be imported using
GenomicAlignments::readGAlignment*()
.
reads <- lapply(path_reads, import)
We generate coverage matrices with ChIP-seq coverage signal
summarized around each ChIP-seq peaks.
Several tools exists to generate such coverage matrices. We will demonstrate the
normalizeToMatrix()
method from r Biocpkg("EnrichedHeatmap")
.
Other alternatives include r Biocpkg("Repitools")
' featureScores()
,
or r Biocpkg("seqplots")
' getPlotSetArray()
.
We will focus on the regions around peak centers, extended from +/-5kb with
a window size of 250 bp. We keep track of the extent of our region of interest
in a xlab
variable, and specify unique column names for each matrix.
Note: when using ChIP-seq data from a bigwig file, use the value_column
parameter
of the normalizeToMatrix()
function.
library(EnrichedHeatmap) coverage_matrices <- lapply( reads, function(x) { normalizeToMatrix( x, resize(merged_peaks, width = 1, fix = "center"), extend = 5000, w = 250, mean_mode = "coverage" ) } ) xlabs <- c("-5kb", "peak center", "+5kb")
We then add the peak coordinates and each matrix to a
RangedSummarizedExperiment
object.
merged_peaks <- SummarizedExperiment( rowRanges = merged_peaks, assays = coverage_matrices )
knitr::opts_chunk$set( fig.width=6, fig.height=5 )
The plotEpistack()
function will use the merged_peaks
object to generate a
complex representation of the ChIP-seq signals around
the genomic feature of interests (here ChIP-seq peaks).
plotEpistack( merged_peaks, assays = c("rep1", "rep2", "input"), tints = c("dodgerblue", "firebrick1", "grey"), titles = c("Rep1", "Rep2" , "Input"), x_labels = xlabs, zlim = c(0, 4), ylim = c(0, 4), metric_col = "mean_scores", metric_title = "Peak score", metric_label = "score" )
If a bin
column is present in the input GRanges
object, it will
be used to annotate the figure and to generate one average profile per bin
in the lower panels. Here we use the peak_type
as our bin column.
rowRanges(merged_peaks)$bin <- rowRanges(merged_peaks)$peak_type plotEpistack( merged_peaks, assays = c("rep1", "rep2", "input"), tints = c("dodgerblue", "firebrick1", "grey"), titles = c("Rep1", "Rep2" , "Input"), x_labels = xlabs, zlim = c(0, 4), ylim = c(0, 4), metric_col = "mean_scores", metric_title = "Peak score", metric_label = "score", bin_palette = colorRampPalette(c("darkorchid1", "dodgerblue", "firebrick1")), npix_height = 300 )
We can also sort on the bins first, and then on peak score. Epistack will
respect the order in the GRanges
object.
merged_peaks <- merged_peaks[order( rowRanges(merged_peaks)$bin, rowRanges(merged_peaks)$mean_scores, decreasing = c(FALSE, TRUE), method = "radix" ), ] plotEpistack( merged_peaks, patterns = c("rep1", "rep2", "input"), tints = c("dodgerblue", "firebrick1", "grey"), titles = c("Rep1", "Rep2" , "Input"), x_labels = xlabs, zlim = c(0, 4), ylim = c(0, 4), metric_col = "mean_scores", metric_title = "Peak score", metric_label = "score", bin_palette = colorRampPalette(c("darkorchid1", "dodgerblue", "firebrick1")), npix_height = 300 )
In this part, we will plot the epigenetic signals at gene promoters, or more
precisely around gene Transcription Start Sites (TSS).
TSS coordinates can be obtained from various sources. One can access
the Ensembl annotations using
r Biocpkg("biomaRt")
, download a .gtf
file and parse it using r Biocpkg("rtracklayer")
's import()
, or use r Biocpkg("AnnotationHub")
and
r Biocpkg("ensembldb")
. It is however important to work with the same genome
version has the one used to align the ChIP-seq reads.
For simplicity, we will use r Biocpkg("EnrichedHeatmap")
example data.
load( system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap"), verbose = TRUE ) tss <- promoters(genes, upstream = 0, downstream = 1) tss$gene_id <- names(tss) tss
Epistack can work with any units and any scores, not limited to
expression data. Here we will use gene expression data from an RNA-seq
experiment, in RPKM units, as it is the data format available in
r Biocpkg("EnrichedHeatmap")
example dataset.
To join the expression data to the TSS coordinates, we will use an
r Biocpkg("epistack")
utility function addMetricAndArrangeGRanges()
:
expr <- data.frame( gene_id = names(rpkm), expr = rpkm ) epidata <- addMetricAndArrangeGRanges( tss, expr, gr_key = "gene_id", order_key = "gene_id", order_value = "expr" ) epidata
We create 5 bins of genes of equal sizes depending on the expression levels,
with r Biocpkg("epistack")
utility function addBins()
:
epidata <- addBins(epidata, nbins = 5) epidata
As previously described, we use r Biocpkg("EnrichedHeatmap")
's
normalizeToMatrix()
function to extract the signals, rename the signal
colmun names, and add them to the epidata GRanges object:
methstack <- normalizeToMatrix( meth, epidata, value_column = "meth", extend = 5000, w = 250, mean_mode = "absolute" ) h3k4me3stack <- normalizeToMatrix( H3K4me3, epidata, value_column = "coverage", extend = 5000, w = 250, mean_mode = "coverage" ) epidata <- SummarizedExperiment( rowRanges = epidata, assays = list(DNAme = methstack, H3K4me3 = h3k4me3stack) )
The epidata
GRanges object is now ready to plot:
plotEpistack( epidata, tints = c("dodgerblue", "orange"), zlim = list(c(0, 1), c(0, 25)), ylim = list(c(0, 1), c(0, 50)), x_labels = c("-5kb", "TSS", "+5kb"), legends = c("%mCpG", "Coverage"), metric_col = "expr", metric_title = "Gene expression", metric_label = "log10(RPKM+1)", metric_transfunc = function(x) log10(x + 1), npix_height = 300 )
To cite {epistack}, please refers to its HAL entry:
Safia Saci, Guillaume Devailly. epistack: An R package to visualise stack profiles of epigenomic signals. 2021, ⟨hal-03401251v2⟩
sessionInfo()
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.