refineSites: Adjust ChIP-seq Read Count Table

Description Usage Arguments Value

Description

For a given set of sites with the same/comparable width, their read count table from multiple samples are adjusted based on potential GC effects. For each sample separately, GC effects are estimated based on their effective GC content and reads count using generalized linear mixture models. Then, count table is adjusted based on estimated GC effects. It it important that the given sites includes both foreground and background regions, see sites below.

Usage

1
2
3
4
5
refineSites(counts, sites, flank = 250L, outputidx = rep(TRUE,
  nrow(counts)), gcrange = c(0.3, 0.8), emtrace = TRUE, plot = TRUE,
  model = c("nbinom", "poisson"), mu0 = 1, mu1 = 50, theta0 = mu0,
  theta1 = mu1, p = 0.2, converge = 1e-04, genome = "hg19",
  gctype = c("ladder", "tricube"))

Arguments

counts

A count matrix with each row corresponding to each element in sites and each column corresponding to one sample. Every value in the matrix indicates the read counts for one site in one sample. It is noted that since effective GC content is used in this function, it is important to extend either original reads or original sites to consider reads that 5' starting in flank regions, when counting sequencing reads.

sites

A GRanges object with length equivalent to number of rows in counts matrix. It is preferable that every GRange have the same width; otherwise, the mixture model is modeling different things with wider GRanges certainly have more reads. However, it is OK if only a minority of GRanges have different width, since the model is pretty robust to outliers. Also, it is important that sites including both foreground and background regions in each sample, otherwise the mixture model will fail to fit two components. Fortunately, if you are inputing a large collection of samples, foreground sites in one sample may play the role as background in other samples. In this case, manually selecting real background is not necessary.

flank

A non-negative integer specifying the flanking width of ChIP-seq binding. This parameter provides the flexibility that reads appear in flankings by decreased probabilities as increased distance from binding region. This paramter helps to define effective GC content calculation.

outputidx

A logical vector with the length equivalent to number of rows in counts. This provides which subset of adjusted count matrix should be outputed. This would be extremely useful if you have manually collected background sites and want to only export the sites you care about.

gcrange

A non-negative numeric vector with length 2. This vector sets the range of GC content to filter regions for GC effect estimation. For human, most regions have GC content between 0.3 and 0.8, which is set as the default. Other regions with GC content beyond this range will be ignored. This range is critical when very few foreground regions are selected for mixture model fitting, since outliers could drive the regression lines. Thus, if possible, first make a scatter plot between counts and GC content to decide this parameter. Alternatively, select a narrower range, e.g. c(0.35,0.7), to aviod outlier effects from both high and low GC-content regions.

emtrace

A logical vector which, when TRUE (default), allows to print the trace of log likelihood changes in EM iterations.

plot

A logical vector which, when TRUE (default), returns miture fitting plot.

model

A character specifying the distribution model to be used in generalized linear model fitting. The default is negative binomial(nbinom), while poisson is also supported currently. More details see gcEffects.

mu0

A non-negative numeric initiating read count signals for background sites. This is treated as the starting value of background mean for poisson/nbinom fitting.

mu1

A non-negative numeric initiating read count signals for foreground sites. This is treated as the starting value of foreground mean for poisson/nbinom fitting.

theta0

A non-negative numeric initiating the shape parameter of negative binomial model for background sites. For more detail, see theta in glm.nb function.

theta1

A non-negative numeric initiating the shape parameter of negative binomial model for foreground sites. For more detail, see theta in glm.nb function.

p

A non-negative numeric specifying the proportion of foreground sites in all estimated sites. This is treated as a starting value for EM algorithm.

converge

A non-negative numeric specifying the condition of EM algorithm termination. EM algorithm stops when the ratio of log likelihood increment to whole log likelihood is less or equivalent to converge.

genome

A BSgenome object containing the sequences of the reference genome that was used to align the reads, or the name of this reference genome specified in a way that is accepted by the getBSgenome function defined in the BSgenome software package. In that case the corresponding BSgenome data package needs to be already installed (see ?getBSgenome in the BSgenome package for the details).

gctype

A character vector specifying choice of method to calculate effective GC content. Default ladder is based on uniformed fragment distribution. A more smoother method based on tricube assumption is also allowed. However, tricube should be not used if flank is too large.

Value

The count matrix after GC adjustment. The matrix values are not integer any more.


tengmx/gcapc documentation built on May 31, 2019, 8:35 a.m.