library(BiocStyle) # Decide whether to display parts for BioC (TRUE) or F1000 (FALSE) on.bioc <- TRUE library(knitr) # Use fig.width = 7 for html and fig.width = 6 for pdf fig.width <- ifelse(on.bioc, 8, 6) knitr::opts_chunk$set(cache = 2, warning = FALSE, message = FALSE, error = FALSE, cache.path = "cache/", fig.path = "figure/", fig.width = fig.width)
This vignette introduces the package Bioconductor package r Biocpkg("CSSQ")
. This package is designed to perform differential binding analysis for ChIP-seq samples. Differential binding is when there is a significant difference in the signal intensities at a region between two groups. The idea behind CSSQ is to implement a statistically robust pipeline which is built for ChIP-seq datasets to identify regions of differential binding among a set of interesting regions. It contains functions to quantify the signal over a predefined set of regions from the user, transform the data, normalize across samples and perform statistical tests to determine differential binding. There is an ever-increasing number of ChIP-seq samples, and this opens the door to compare these binding events between two groups of samples. CSSQ does this while considering the signal to noise ratios, paired control experiments and the lack of multiple (>2) replicates in a typical ChIP-seq experiment. It controls the false discovery rates while not compromising in its sensitivity. This document will give a brief overview of the processing steps.
The workflow for CSSQ
is as follows
To use CSSQ
, you will require a bed file containing the regions of interest as well as sample information file which contains the necessary information about the samples being analyzed. Below are examples of both files.
library(CSSQ) regionBed <- read.table(system.file("extdata", "chr19_regions.bed", package="CSSQ",mustWork = TRUE)) sampleInfo <- read.table(system.file("extdata", "sample_info.txt", package="CSSQ",mustWork = TRUE),sep="\t",header=TRUE) head(regionBed) sampleInfo
CSSQ
provides two options for obtaining raw count data used to perform the analysis. The first option is to ask CSSQ
to count the reads and perform background correction. The following commands will perform the steps and return an object with count data and meta data.
countData <- getRegionCounts(system.file("extdata", "chr19_regions.bed", package="CSSQ"),sampleInfo,sampleDir = system.file("extdata", package="CSSQ")) countData head(assays(countData)$countData) colData(countData) rowRanges(countData)
The second option is to provide CSSQ
with the count data in a tab separated format. The count data given will directly be used in transformation and normalization steps and no background correction will be performed.
countData <- loadCountData(system.file("extdata", "sample_count_data.txt", package="CSSQ",mustWork = TRUE),system.file("extdata", "chr19_regions.bed", package="CSSQ"),sampleInfo) countData head(assays(countData)$countData) colData(countData) rowRanges(countData)
Anscombe transformation is then applied to the count data using the ansTransform
function. This function also has an option parameter to plot the data distribution of the data before and after transformation.
ansCountData <- ansTransform(countData) ansCountData head(assays(ansCountData)$ansCount)
Anscombe transformed data is then normalized across samples. Normalization in CSSQ
using the normalizeData
function. It takes as input the number of clusters to use in the underlying k-means algorithm that is used in the normalization process. It is advised to choose the number of clusters by analyzing the data distribution.
normInfo <- normalizeData(ansCountData,numClusters=4) normInfo head(assays(normInfo)$normCount)
The normalized data is then used by the DBAnalyze
function to perform statistical tests to identify differentially bound regions. CSSQ
utilized non-parametric approaches to calculate the test statistics and to calculate the p-value from the test statistics. This is done due to prevalent trend observed in ChIP-seq datasets of not having more than two replicates. The resulting GRanges
object from this function contains the Benjamini Hochberg adjusted P-value and a Fold change for each of the input regions.
result <- DBAnalyze(normInfo,comparison=c("HSMM","HESC")) result
For convenience, most of the above steps are wrapped in preprocessData
function to perform them all at one go. All the above steps can be performed using the below commands.
#To let CSSQ calculate the counts processedData <- preprocessData(system.file("extdata", "chr19_regions.bed", package="CSSQ"),system.file("extdata", "sample_info.txt", package="CSSQ"),sampleDir = system.file("extdata", package="CSSQ"),numClusters=4,noNeg=TRUE,plotData=FALSE) #To provide CSSQ with the counts processedData <- preprocessData(system.file("extdata", "chr19_regions.bed", package="CSSQ"),system.file("extdata", "sample_info.txt", package="CSSQ"),inputCountData = system.file("extdata", "sample_count_data.txt", package="CSSQ",mustWork = TRUE),numClusters=4,noNeg=TRUE,plotData=FALSE) #To find differential binding sites result <- DBAnalyze(processedData,comparison=c("HSMM","HESC")) processedData result
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.