fragmentLenDetect: Fragments length detection from single-end sequencing samples
In nucleR: Nucleosome positioning package for R

Description Usage Arguments Details Value Author(s) Examples

When using single-ended sequencing, the resulting partial sequences map only in one strand, causing a bias in the coverage profile if not corrected. The only way to correct this is knowing the average size of the real fragments. nucleR uses this information when preprocessing single-ended sequences. You can provide this information by your own (usually a 147bp length is a good aproximation) or you can use this method to automatically guess the size of the inserts.

fragmentLenDetect(
  reads,
  samples = 1000,
  window = 5000,
  min.shift = 1,
  max.shift = 100,
  mc.cores = 1,
  as.shift = FALSE
)

## S4 method for signature 'AlignedRead'
fragmentLenDetect(
  reads,
  samples = 1000,
  window = 1000,
  min.shift = 1,
  max.shift = 100,
  mc.cores = 1,
  as.shift = FALSE
)

## S4 method for signature 'GRanges'
fragmentLenDetect(
  reads,
  samples = 1000,
  window = 1000,
  min.shift = 1,
  max.shift = 100,
  mc.cores = 1,
  as.shift = FALSE
)

`reads`	Raw single-end reads ShortRead::AlignedRead or GenomicRanges::GRanges format)
`samples`	Number of samples to perform the analysis (more = slower but more accurate)
`window`	Analysis window. Usually there's no need to touch this parameter.
`min.shift, max.shift`	Minimum and maximum shift to apply on the strands to detect the optimal fragment size. If the range is too big, the performance decreases.
`mc.cores`	If multicore support, maximum number of cores allowed to use.
`as.shift`	If TRUE, returns the shift needed to align the middle of the reads in opposite strand. If FALSE, returns the mean inferred fragment length.

This function shifts one strand downstream one base by one from min.shift to max.shift. In every step, the correlation on a random position of length window is checked between both strands. The maximum correlation is returned and averaged for samples repetitions.

The final returned length is the best shift detected plus the width of the reads. You can increase the performance of this function by reducing the samples value and/or narrowing the shift range. The window size has almost no impact on the performance, despite a to small value can give biased results.

Inferred mean lenght of the inserts by default, or shift needed to align strands if as.shift=TRUE.

Oscar Flores oflores@mmb.pcb.ub.es

library(GenomicRanges)
library(IRanges)

# Create a sinthetic dataset, simulating single-end reads, for positive and
# negative strands
# Positive strand reads
pos <- syntheticNucMap(nuc.len=40, lin.len=130)$syn.reads
# Negative strand (shifted 147bp)
neg <- IRanges(end=start(pos)+147, width=40)
sim <- GRanges(
    seqnames="chr1",
    ranges=c(pos, neg),
    strand=c(rep("+", length(pos)), rep("-", length(neg)))
)

# Detect fragment lenght (we know by construction it is really 147)
fragmentLenDetect(sim, samples=50)
# The function restricts the sampling to speed up the example