Chromatin Immuno-Precipitation followed by Sequencing (ChIP-Seq) is used to determine the binding sites of any protein of interest, such as transcription factors or histones with or without a specific modification, at a genome scale [@barski2007; @park2009]. ChIP-Seq entails treating cells with a cross-linking reagent such as formaldehyde; isolating the chromatin and fragmenting it by sonication; immuno-precipitating with antibodies directed against the protein of interest; reversing crosslink; DNA purification and amplification before submission to sequencing. These many steps can introduce biases that make ChIP-Seq more qualitative than quantitative. Different efficiencies in nuclear extraction, DNA sonication, DNA amplification or antibody recognition make it challenging to distinguish between true differential binding events and technical variability.
This problem was addressed by using an external spike-in control to keep track of technical biases between conditions [@orlando2014; @bonhoure2014]. Exogenous DNA from a different non-closely related species is inserted during the protocol to infer scaling factors. This modification was shown to be especially important for revealing global histone modification differences, that are not caught by traditional downstream data normalization techniques, such as Histone H3 lysine-27 trimethyl (H3K27me3) upon Ezh2 inhibitor treatment [@trojer2016].
ChIPSeqSpike provides tools for ChIP-Seq spike-in normalization, assessment and analysis. Conversely to a majority of ChIP-Seq related tools, ChIPSeqSpike does not rely on peak detection. However, if one wants to focus on peaks, ChIPSeqSpike is flexible enough to do so. ChIPSeqSpike provides ready to use scaled bigwig files as output and scaling factors values. We believe that ChIPSeqSpike will be of great value to the genomics community.
On Windows operating system, reading of BigWig files fail due to the Bioconductor package rtracklayer >= 1.37.6 that does not support this file format. Therefore, the boost mode, the input subtraction method, and scaling method do not work on this operating system.
A case study reported Chromatin Immuno-Precipitation followed by Sequencing (ChIP-Seq) experiments that did not show differences upon inhibitor treatment with traditional normalization procedures [@orlando2014]. These experiments looked at the effect of DOT1L inhibitor EPZ5676 [@daigle2013] treatment on the Histone H3 lysine-79 dimethyl (H3K79me2) modification in Jurkat cells. DOT1L is involved in the RNA Polymerase II pause release and licensing of transcriptional elongation. H3K79me2 ChIP-Seq were performed on cells treated with 0%, 50% and 100% EPZ5676 inhibitor (see next section for details on data).
The following code performs spike-in normalization with a wrapper function on
data sub-samples. A 'test_chipseq' temporary results folder is also created.
\newline
library("ChIPSeqSpike") ## Preparing testing data info_file_csv <- system.file("extdata/info.csv", package="ChIPSeqSpike") bam_path <- system.file("extdata/bam_files", package="ChIPSeqSpike") bigwig_path <- system.file("extdata/bigwig_files", package="ChIPSeqSpike") gff_vec <- system.file("extdata/test_coord.gff", package="ChIPSeqSpike") genome_name <- "hg19" output_folder <- "test_chipseqspike" bigwig_files <- system.file("extdata/bigwig_files", c("H3K79me2_0-filtered.bw", "H3K79me2_100-filtered.bw", "H3K79me2_50-filtered.bw", "input_0-filtered.bw", "input_100-filtered.bw", "input_50-filtered.bw"), package="ChIPSeqSpike") ## Copying example files dir.create("./test_chipseqspike") mock <- file.copy(bigwig_files, "test_chipseqspike") ## Performing spike-in normalization if (.Platform$OS.type != "windows") { csds_test <- spikePipe(info_file_csv, bam_path, bigwig_path, gff_vec, genome_name, outputFolder = output_folder) }
The data used in this documentation represent a gold-standard example of the importance of using spike-in controls with ChIP-Seq experiments. It uses chromatin from Drosophila Melanogaster as exogenous spike-in control to correct experimental biases. Without a spike-in control and using only RPM normalization, proper differences in the H3K79me2 histone modification in human Jurkat cells upon EPZ5676 inhibitor treatment were not observed [@orlando2014].
This dataset is made of bigwig and bam files of H3K79me2 ChIP-Seq data and corresponding input DNA controls (see input subtraction section). Bam files contain data aligned to the Human reference genome Hg19 or to the Drosophila reference genome dm3. The latest is used to compute external spike-in scaling factors. All above mentioned data are available at 0%, 50% and 100% EPZ5676 inhibitor treatment.
For the sake of memory and computation time efficiency, bigwig files used in
this vignette are limited to chromosome 1. Reads falling in the top 10% mostly
bound genes (at 0% treatment) with length between 700-800 bp were kept in bam
files. For efficient plotting functions testing, only binding values of the top
100 mostly bound genes are used and can be accessed with
data(result_extractBinding)
. Scores for factors and read counts were computed
on the whole dataset and are available through
data(result_estimateScalingFactors)
(see below).
The whole dataset is accessible at GSE60104. Specifically, the data used are H3K79me2 0% (GSM1465004), H3K79me2 50% (GSM1465006), H3K79me2 100% (GSM1465008), input DNA 0% (GSM1511465), input DNA 50% (GSM1511467) and input DNA 100% (GSM1511469).
The data were treated as follows: Quality of sequencing was assessed with FastQC v0.11.4. Reads having less than 80% of quality scores above 25 were removed with NGSQCToolkit v2.3.3 [@ngsqctoolkit]. Homo Sapiens Hg19 and Drosophila Melanogaster dm3 from Illumina igenomes UCSC Collection were used. ChIP-Seq data were aligned with Bowtie2 v2.1.0 [@bowtie2] with default parameters. Sam outputs were converted to Bam with Samtools v1.0.6 [@samtools] and sorted with Picard tools v1.88. Data were further processed with Pasha v0.99.21 [@pasha]. Fixed steps wiggle files were converted to bigwigs with wigToBigWig.
Results with the complete dataset are also provided in this documentation.
The spike-in normalization procedure consists of 4 steps: RPM scaling, input DNA subtraction, RPM scaling reversal and exogenous spike-in DNA scaling. Below is detailed the different steps included in the above mentioned wrapper function 'spikePipe'.
The different data necessary for proper spike-in scaling are provided in a csv or a tab separated txt file. The columns must contain proper names and are organized as follows: Experiment name (expName); bam file name of data aligned to the endogenous reference genome (endogenousBam); bam file name of data aligned to the exogenous reference genome (exogenousBam); the corresponding input DNA bam file aligned to the endogenous reference genome (inputBam); the fixed steps bigwig file name of data aligned to the endogenous reference genome (bigWigEndogenous) and the fixed steps bigwig file names of the corresponding input DNA experiment aligned to the endogenous reference genome (bigWigInput). \newline ```r info_file <- read.csv(info_file_csv) head(info_file)
From the info file, two kinds of objects can be generated: either a ChIPSeqSpikeDataset or a ChIPSeqSpikeDatasetList depending upon the number of input DNA experiments. A ChIPSeqSpikeDatasetList object is a list of ChIPSeqSpikeDataset object that is created if several input DNA experiments are used. In this latter case, ChIP-Seq experiments are grouped by their corresponding input DNA. The function spikeDataset creates automatically the suitable object. The folder path to the bam and fixed steps bigwig files must be provided. \newline ```r csds_test <- spikeDataset(info_file_csv, bam_path, bigwig_path) is(csds_test)
Reading and processing bigwig and bam files can be memory-greedy. By default, files are read, processed and written at each step of the normalization procedure to enable data treatment on a regular desktop computer. One could wish to reduce the time of computation especially when processing a lot of data. ChIPSeqSpike reduces such time by providing a boost mode. \newline ```r if (.Platform$OS.type != "windows") { csds_testBoost <- spikeDataset(info_file_csv, bam_path, bigwig_path, boost = TRUE) is(csds_testBoost) }
Binding scores for each experiment are stored in a GRanges object and are directly accessible by functions. \newline ```r if (.Platform$OS.type != "windows") { getLoadedData(csds_testBoost[[1]]) }
Even if optimizing greatly the time of computing, one should know that loading binding scores of all experiments is greedy in memory and should be used with caution. The boost mode is ignored in the rest of the vignette, but all code provided in the following sections, with the exception of section 3.1 ( plotTransform), can be run with csds_testBoost.
A ChIPSeqSpikeDataset object, at this point, is made of slots storing paths to files. In order to compute scaling factors, bam counts are first computed. A scaling factor is defined as 1000000/bam_count. The method estimateScalingFactors returns bam counts and endogenous/exogenous scaling factors for all experiments. In the following example, scores are computed using chromosome 1 only. Scores on the whole dataset are also indicated below. \newline ```r csds_test <- estimateScalingFactors(csds_test, verbose = FALSE)
The different scores can be visualized: \newline ```r ## Scores on testing sub-samples spikeSummary(csds_test) ##Scores on whole dataset data(result_estimateScalingFactors) spikeSummary(csds)
An important parameter to keep in mind when performing spike-in with ChIP-seq is the percentage of exogenous DNA relative to that of endogenous DNA. The amount of exogenous DNA should be between 2-25% of endogenous DNA. The method getRatio returns the percentage of exogenous DNA and throws a warning if this percentage is not within the 2-25% range. In theory, having more than 25% exogenous DNA should not affect the normalization, whereas having less than 2% is usually not sufficient to perform a reliable normalization. \newline ```r getRatio(csds_test)
data(ratio) ratio
### RPM scaling The first normalization applied to the data is the 'Reads Per Million' (RPM) mapped reads. The method 'scaling' is used to achieve such normalization using default parameters. It is also used to reverse the RPM normalization and apply exogenous scaling factors (see sections [2.3.5](#RPMreversal) and [2.3.6](#exoscaling)). \newline \newline ```r if (.Platform$OS.type != "windows") { csds_test <- scaling(csds_test, outputFolder = output_folder) }
In the context of this vignette, output_folder is precised since the testing files were copied to a temporary 'test_chipseqspike' folder. The slots containing paths to files will be updated to this folder. If not precised, the RPM scaled bigwig files are written to the same folder containing bigwigs. This statement is also applicable for the next operations below.
When Immuno-Precipitating (IP) DNA bound by a given protein, a control is needed to distinguish background noise from true signal. This is typically achieved by performing a mock IP, omitting the use of antibody. After mock IP sequencing, one can notice peaks of signal above background. These peaks have to be removed from the experiment since they represent false positives. The inputSubtraction method simply subtracts scores of the input DNA experiment from the corresponding ones. If in boost mode, the input subtracted values are stored in the dataset object and no files are written. For this latter case, the method exportBigWigs can be used to output the transformed files. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- inputSubtraction(csds_test) }
### RPM scaling reversal {#RPMreversal} After RPM and input subtraction normalization, the RPM normalization is reversed in order for the data to be normalized by the exogenous scaling factors. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- scaling(csds_test, reverse = TRUE) }
\newpage
Finally, exogenous scaling factors are applied to the data. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- scaling(csds_test, type = "exo") }
### Extract binding values The last step of data processing is to extract and format binding scores in order to use plotting methods. The 'extractBinding' method extracts binding scores at different locations and stores these values in the form of PlotSetArray objects and matrices (see ?extractBinding for more details). The scores are retrieved on annotations provided in a gff file. If one wishes to focus on peaks, their coordinates should be submitted at this step. The genome name must also be provided. For details about installing the required BSgenome package corresponding to the endogenous organism, see the [BSgenome](https://bioconductor.org/packages/release/bioc/html/BSgenome.html) package documentation. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- extractBinding(csds_test, gff_vec, genome_name) }
ChIPSeqSpike offers several graphical methods for normalization diagnosis and data exploration. These choices enable one to visualize each step of the normalization through exploring inter-samples differences using profiles, heatmaps, boxplots and correlation plots.
In the following sections, the testing data are restricted to the 100 mostly bound genes. Results on the complete set of hg19 genes are also indicated.
The first step of spike-in normalized ChIP-Seq data analysis is an inter-sample comparison by meta-gene or meta-annotation profiling. The method 'plotProfile' automatically plots all experiments at the start, midpoint, end and composite locations of the annotations provided to the method extractBinding in gff format. Here is the result of profiling H3K79me2 on the 100 mostly bound genes at 0% inhibitor treatment (figure \ref{figure1}). \newline ```r", fig.height = 6} data(result_extractBinding) plotProfile(csds, legend = TRUE)
The unspiked data (however RPM scaled and input subtracted) can be added to the plot (figure \ref{figure2}). \newline ```r", fig.height = 7} plotProfile(csds, legend = TRUE, notScaled = TRUE)
\newpage The effect of the individual processing steps for each experiment can also be plotted. \newline ```r plotTransform(csds, legend = TRUE, separateWindows = TRUE)
## Heatmaps plotHeatmaps is a versatile method based on the plotHeatmap method of the seqplots package [@seqplots]. This method enables one to represent data at different locations (start, end, midpoint, composite) and at different stages of the normalization process. Different scaling (log, zscore, etc) and different clustering approaches (k-means, hierarchical, etc) can be used (see documentation for more details). Figure \ref{figure3} shows a k-means clustering of spiked data, each group being sub-sorted by decreasing values. \newline \newline ```r", fig.height = 5} plotHeatmaps(csds, nb_of_groups = 2, clustering_method = "kmeans")
\newpage Figure \ref{figure4} illustrates a clustering by decreasing values on the whole dataset.
boxplotSpike plots boxplots of the mean values of ChIP-seq experiments on the annotations given to the extractBinding method. It offers a wide range of graphical representations that includes violin plots (see documentation for details). Figure \ref{figure5} illustrates all transformations of all dataset indicating confidence intervals. \newpage ```r", fig.height = 6} par(cex.axis=0.5) boxplotSpike(csds, rawFile = TRUE, rpmFile = TRUE, bgsubFile = TRUE, revFile = TRUE, spiked = TRUE, outline = FALSE)
\newpage Figure \ref{figure6} shows spiked experiments indicating each mean value, mean and standard deviation with a violin plot representation. \newline ```r", fig.height = 6} boxplotSpike(csds, outline = FALSE, violin=TRUE, mean_with_sd = TRUE, jitter = TRUE)
The plotCor method plots the correlation between ChIP-seq experiments using heatscatter plot or, if heatscatterplot = FALSE, correlation tables. For heatscatter plots, ChIPSeqSpike makes use of the heatscatter function of the package LSD and the corrplot function of the package corrplot is used to generate correlation tables. This offers a wide range of graphical possibilities for assessing the correlation between experiments and transformation steps (see documentation for more details).
Figure \ref{figure7} shows two correlation table representations between spiked experiments. \newline ```r", fig.height = 6, fig.width=10} par(mfrow=c(1,2)) plotCor(csds, heatscatterplot = FALSE) plotCor(csds, heatscatterplot = FALSE, method_corrplot = "number")
Figure \ref{figure8} illustrates a heatscatter plot of spiked data after log transformation (only positive mean binding values are kept) and figure \ref{figure9} is the result of running the same code on the whole refseq Hg19 gene set. \newline ```r", fig.height=6} plotCor(csds, method_scale = "log")
```r unlink("test_chipseqspike/", recursive = TRUE)
# Session info ```r sessionInfo(package="ChIPSeqSpike")
references:
family: Zhao given: Keji container-title: Cell volume: 129 URL: 'http://dx.doi.org/10.1016/10.1016/j.cell.2007.05.009' DOI: 10.1016/j.cell.2007.05.009 issue: 4 publisher: Cell press page: 823-837 type: article-journal issued: year: 2007 month: 5
id: park2009 title: 'ChIP–seq: advantages and challenges of a maturing technology' author:
family: Park given: PJ container-title: Nature Reviews Genetics volume: 10 URL: 'http://dx.doi.org/10.1038/nrg2641' DOI: 10.1038/nrg2641 issue: publisher: Nature Publishing Group page: 669-680 type: article-journal issued: year: 2009 month: 10
id: bonhoure2014 title: 'Quantifying ChIP-seq data: a spiking method providing an internal reference for sample-to-sample normalization' author:
family: Consortium given: CycliX container-title: Genome Research volume: 24 URL: 'http://dx.doi.org/10.1101/gr.168260.113' DOI: 10.1101/gr.168260.113 issue: publisher: Cold Spring Harbor Press page: 1157-1168 type: article-journal issued: year: 2014 month: 04
id: orlando2014 title: 'Quantitative ChIP-seq Normalization reveals global modulation of the Epigenome' author:
family: Guenther given: Matthew G container-title: Cell Reports volume: 9 URL: 'http://dx.doi.org/10.1016/j.celrep.2014.10.018' DOI: 10.1016/j.celrep.2014.10.018 issue: 3 publisher: Cell press page: 1163–1170 type: article-journal issued: year: 2014 month: 11
id: trojer2016 title: 'An Alternative Approach to ChIP-Seq Normalization Enables Detection of Genome-Wide Changes in Histone H3 Lysine 27 Trimethylation upon EZH2 Inhibition' author:
family: Trojer given: Patrick container-title: PLoS ONE volume: 11 URL: 'http://dx.doi.org/10.1371/journal.pone.0166438' DOI: 10.1371/journal.pone.0166438 issue: 11 publisher: PLoS type: article-journal issued: year: 2016 month: 11
id: daigle2013 title: 'Potent inhibition of DOT1L as treatment of MLL-fusion leukemia' author:
family: Pollock given: RM container-title: Blood volume: 122 URL: 'http://dx.doi.org/10.1182/blood-2013-04-497644' DOI: 10.1182/blood-2013-04-497644 issue: 6 publisher: American Society of Hematology page: 1017-1025 type: article-journal issued: year: 2013 month: 8
id: ngsqctoolkit title: 'NGS QC toolkit: A toolkit for quality control of next generation sequencing data' author:
family: Jain given: Mukesh container-title: PLoS ONE volume: 7 URL: 'http://dx.doi.org/10.1371/journal.pone.0030619' DOI: 10.1371/journal.pone.0030619 issue: 2 publisher: PLoS type: article-journal issued: year: 2012 month: 2
id: bowtie2 title: 'Fast gapped-read alignment with Bowtie 2' author:
family: Salzberg given: Steven L container-title: Nature Method volume: 9 URL: 'http://dx.doi.org/10.1038/nmeth.1923' DOI: 10.1038/nmeth.1923. issue: 4 publisher: Nature Publishing Group type: article-journal issued: year: 2012 month: 3
id: samtools title: 'The Sequence Alignment/Map format and SAMtools' author:
family: Durbin given: Richard container-title: Bioinformatics volume: 25 URL: 'http://dx.doi.org/10.1093/bioinformatics/btp352' DOI: 10.1093/bioinformatics/btp352 issue: 16 publisher: Oxford Academic type: article-journal issued: year: 2009 month: 8
id: pasha title: 'Pasha a versatile R package for piling chromatin HTS data' author:
family: Andrau given: Jean-Christophe container-title: Bioinformatics volume: 25 URL: 'http://dx.doi.org/10.1093/bioinformatics/btp352' DOI: 10.1093/bioinformatics/btp352 issue: 16 publisher: Oxford Academic type: article-journal issued: year: 2009 month: 8
id: seqplots title: 'SeqPlots - Interactive software for exploratory data analyses, pattern discovery and visualization in genomics' author:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.