optimalBinsize: Assess optimal genomic bin size to partition read counts.

View source: R/otimalBinsize.R

optimalBinsizeR Documentation

Assess optimal genomic bin size to partition read counts.

Description

Calculate Akaike's information criterion (AIC) and cross-validation (CV) log-likelihood to infer the optimal bin size to partition read counts across genome.

Usage

optimalBinsize(bamfiles = NULL, bamnames = NULL, pathToBams = NULL,
  binSizes = c(10, 30, 50, 100, 250, 500, 750, 1000), measure = "CV",
  lineColor = "red4", chromosomesFilter = c("X", "Y", "M", "MT"),
  savePlot = FALSE, plotPrefix = "optimalBinsize", minMapq = 20,
  isPaired = NA, isProperPair = NA, isUnmappedQuery = FALSE,
  hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
  isFirstMateRead = NA, isSecondMateRead = NA, isSecondaryAlignment = NA,
  isDuplicate = FALSE)

Arguments

bamfiles

A character vector of BAM file names with or without full path. If NULL (default), all files with extension .bam, are read from directory path.

bamnames

An optional character vector of sample names. Defaults to file names with extension .bam removed.

pathToBams

If bamfiles is NULL, all files ending with ".bam" extension will be read from this path.

binSizes

A numeric vector of genomic bin sizes, in units of kilo base pairs (1000 base pairs), e.g. binSizes = c(10, 30, 50) corresponds to bins of 10, 30 and 50 kbp bins.

measure

The goodness of fit criteria (AIC or CV). Defaults to "CV".

lineColor

Line color to use in plot.

chromosomesFilter

A character vector specifying which chromosomes to filter out. Defaults to the sex chromosomes and mitochondrial reads, i.e. c("X", "Y", "M", "MT"). Use NA to use all chromosomes.

savePlot

if TRUE (default) saves plots of each sample to working directory.

plotPrefix

Prefix for plot title and pdf file name. Defaults to "optimalBinsize".

minMapq

If quality scores exists, the minimum quality score required in order to keep a read (20, default).

isPaired

A logical(1) indicating whether unpaired (FALSE), paired (TRUE), or any (NA, default) read should be returned.

isProperPair

A logical(1) indicating whether improperly paired (FALSE), properly paired (TRUE), or any (NA, default) read should be returned.

isUnmappedQuery

A logical(1) indicating whether unmapped (TRUE), mapped (FALSE, default), or any (NA) read should be returned.

hasUnmappedMate

A logical(1) indicating whether reads with mapped (FALSE), unmapped (TRUE), or any (NA, default) mate should be returned.

isMinusStrand

A logical(1) indicating whether reads aligned to the plus (FALSE), minus (TRUE), or any (NA, default) strand should be returned.

isMateMinusStrand

A logical(1) indicating whether mate reads aligned to the plus (FALSE), minus (TRUE), or any (NA, default) strand should be returned.

isFirstMateRead

A logical(1) indicating whether the first mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA, default).

isSecondMateRead

A logical(1) indicating whether the second mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA, default).

isSecondaryAlignment

A logical(1) indicating whether alignments that are primary (FALSE), are not primary (TRUE) or whose primary status does not matter (NA, default) should be returned.

isDuplicate

A logical(1) indicating that un-duplicated (FALSE, default), duplicated (TRUE), or any (NA) reads should be returned.

Details

As a guidance, choose bin sizes which have low AIC and/or high CV values but also contain 30-180 read counts on average. This strikes a reasonable balance between error variability and bias of CNA. Using a much smaller bin size may result in many genomic regions with zero read count and make the overall analysis non-informative. At the other extreme, using a much bigger bin size will 'smooth out' some pattern of alteration (i.e. increasing bias). The process of estimating the optimal bin size is in the context of low-coverage sequence data, so use sensible values for the binSizes argument when the input data is not of shallow whole-genome depth (<10 million reads).

Value

Returns a list. The first element is a data.frame holding information of the average read counts per bin size, the other elements are sample-specific ggplot objects.

Author(s)

Dineika Chandrananda

See Also

Internally, the function opt.win.onesample of the NGSoptwin package is used.

Examples

     ## Not run: 
      vignette("CNAclinic")
     
## End(Not run)


sdchandra/CNAclinic documentation built on Aug. 8, 2024, 4:08 p.m.