coverage-methods: Coverage of a set of ranges
In Bioconductor/IRanges: Foundation of integer range manipulation in Bioconductor

coverage-methods

R Documentation

Coverage of a set of ranges

Description

For each position in the space underlying a set of ranges, counts the number of ranges that cover it.

Usage

coverage(x, shift=0L, width=NULL, weight=1L, ...)

## S4 method for signature 'IntegerRanges'
coverage(x, shift=0L, width=NULL, weight=1L,
            method=c("auto", "sort", "hash", "naive"))

## S4 method for signature 'IntegerRangesList'
coverage(x, shift=0L, width=NULL, weight=1L,
            method=c("auto", "sort", "hash", "naive"))

Arguments

`x`	A IntegerRanges, Views, or IntegerRangesList object. See ?`coverage-methods` in the GenomicRanges package for `coverage` methods for other objects.
`shift`, `weight`	`shift` specifies how much each range in `x` should be shifted before the coverage is computed. A positive shift value will shift the corresponding range in `x` to the right, and a negative value to the left. NAs are not allowed. `weight` assigns a weight to each range in `x`. If `x` is an IntegerRanges or Views object: each of these arguments must be an integer or numeric vector parallel to `x` (will get recycled if necessary). Alternatively, each of these arguments can also be specified as a single string naming a metadata column in `x` (i.e. a column in `mcols(x)`) to be used as the `shift` (or `weight`) vector. Note that when `x` is an IPos object, each of these arguments can only be a single number. If `x` is an IntegerRangesList object: each of these arguments must be a numeric vector or list-like object of the same length as `x` (will get recycled if necessary). If it's a numeric vector, it's first turned into a list with `as.list`. After recycling, each list element `shift[[i]]` (or `weight[[i]]`) must be an integer or numeric vector parallel to `x[[i]]` (will get recycled if necessary). If `weight` is an integer vector or list-like object of integer vectors, the coverage vector(s) will be returned as integer-Rle object(s). If it's a numeric vector or list-like object of numeric vectors, the coverage vector(s) will be returned as numeric-Rle object(s).
`width`	Specifies the length of the returned coverage vector(s). If `x` is an IntegerRanges object: `width` must be `NULL` (the default), an NA, or a single non-negative integer. After being shifted, the ranges in `x` are always clipped on the left to keep only their positive portion i.e. their intersection with the [1, +inf) interval. If `width` is a single non-negative integer, then they're also clipped on the right to keep only their intersection with the [1, width] interval. In that case `coverage` returns a vector of length `width`. Otherwise, it returns a vector that extends to the last position in the underlying space covered by the shifted ranges. If `x` is a Views object: Same as for a IntegerRanges object, except that, if `width` is `NULL` then it's treated as if it was `length(subject(x))`. If `x` is a IntegerRangesList object: `width` must be `NULL` or an integer vector parallel to `x` (i.e. with one element per list element in `x`). If not `NULL`, the vector must contain NAs or non-negative integers and it will get recycled to the length of `x` if necessary. If `NULL`, it is replaced with `NA` and recycled to the length of `x`. Finally `width[i]` is used to compute the coverage vector for `x[[i]]` and is therefore treated like explained above (when `x` is a IntegerRanges object).
`method`	If `method` is set to `"sort"`, then `x` is sorted previous to the calculation of the coverage. If `method` is set to `"hash"` or `"naive"`, then `x` is hashed directly to a vector of length `width` without previous sorting. The `"hash"` method is faster than the `"sort"` method when `x` is large (i.e. contains a lot of ranges). When `x` is small and `width` is big (e.g. `x` represents a small set of reads aligned to a big chromosome), then `method="sort"` is faster and uses less memory than `method="hash"`. The `"naive"` method is a slower version of the `"hash"` method that has the advantage of avoiding floating point artefacts in the no-coverage regions of the numeric-Rle object returned by `coverage()` when the weights are supplied as a numeric vector of type `double`. See "FLOATING POINT ARITHMETIC CAN BRING A SURPRISE" section in the Examples below for more information. Using `method="auto"` selects between the `"sort"` and `"hash"` methods, picking the one that is predicted to be faster based on `length(x)` and `width`.
`...`	Further arguments to be passed to or from other methods.

Value

If x is a IntegerRanges or Views object: An integer- or numeric-Rle object depending on whether weight is an integer or numeric vector.

If x is a IntegerRangesList object: An RleList object with one coverage vector per list element in x, and with x names propagated to it. The i-th coverage vector can be either an integer- or numeric-Rle object, depending on the type of weight[[i]] (after weight has gone thru as.list and recycling, like described previously).

Author(s)

H. Pagès and P. Aboyoun

Examples

## ---------------------------------------------------------------------
## A. COVERAGE OF AN IRanges OBJECT
## ---------------------------------------------------------------------
x <- IRanges(start=c(-2L, 6L, 9L, -4L, 1L, 0L, -6L, 10L),
             width=c( 5L, 0L, 6L,  1L, 4L, 3L,  2L,  3L))
coverage(x)
coverage(x, shift=7)
coverage(x, shift=7, width=27)
coverage(x, shift=c(-4, 2))  # 'shift' gets recycled
coverage(x, shift=c(-4, 2), width=12)
coverage(x, shift=-max(end(x)))

coverage(restrict(x, 1, 10))
coverage(reduce(x), shift=7)
coverage(gaps(shift(x, 7), start=1, end=27))

## With weights:
coverage(x, weight=as.integer(10^(0:7)))  # integer-Rle
coverage(x, weight=c(2.8, -10))  # numeric-Rle, 'shift' gets recycled

## ---------------------------------------------------------------------
## B. FLOATING POINT ARITHMETIC CAN BRING A SURPRISE
## ---------------------------------------------------------------------
## Please be aware that rounding errors in floating point arithmetic can
## lead to some surprising results when computing a weighted coverage:
y <- IRanges(c(4, 10), c(18, 15))
w1 <- 0.958
w2 <- 1e4
cvg <- coverage(y, width=100, weight=c(w1, w2))
cvg  # non-zero coverage at positions 19 to 100!

## This is an artefact of floating point arithmetic and the algorithm
## used to compute the weighted coverage. It can be observed with basic
## floating point arithmetic:
w1 + w2 - w2 - w1  # very small non-zero value!

## Note that this only happens with the "sort" and "hash" methods but
## not with the "naive" method:
coverage(y, width=100, weight=c(w1, w2), method="sort")
coverage(y, width=100, weight=c(w1, w2), method="hash")
coverage(y, width=100, weight=c(w1, w2), method="naive")

## These very small non-zero coverage values in the no-coverage regions
## of the numeric-Rle object returned by coverage() are not always
## present. But when they are, they can cause problems downstream or
## in unit tests. For example downstream code that relies on things
## like 'cvg != 0' to find regions with coverage won't work properly.
## This can be mitigated either by selecting the "naive" method (be aware
## that this can slow down things significantly) or by "cleaning" 'cvg'
## first e.g. with something like 'cvg <- round(cvg, digits)' where
## 'digits' is a carefully chosen number of digits:
cvg <- round(cvg, digits=3)

## Note that this rounding will also have the interesting side effect of
## reducing the memory footprint of the Rle object in general (because
## some runs might get merged into a single run as a consequence of the
## rounding).

## ---------------------------------------------------------------------
## C. COVERAGE OF AN IPos OBJECT
## ---------------------------------------------------------------------
pos_runs <- IRanges(c(1, 5, 9), c(10, 8, 15))
ipos <- IPos(pos_runs)
coverage(ipos)

## ---------------------------------------------------------------------
## D. COVERAGE OF AN IRangesList OBJECT
## ---------------------------------------------------------------------
x <- IRangesList(A=IRanges(3*(4:-1), width=1:3), B=IRanges(2:10, width=5))
cvg <- coverage(x)
cvg

stopifnot(identical(cvg[[1]], coverage(x[[1]])))
stopifnot(identical(cvg[[2]], coverage(x[[2]])))

coverage(x, width=c(50, 9))
coverage(x, width=c(NA, 9))
coverage(x, width=9)  # 'width' gets recycled

## Each list element in 'shift' and 'weight' gets recycled to the length
## of the corresponding element in 'x'.
weight <- list(as.integer(10^(0:5)), -0.77)
cvg2 <- coverage(x, weight=weight)
cvg2  # 1st coverage vector is an integer-Rle, 2nd is a numeric-Rle

identical(mapply(coverage, x=x, weight=weight), as.list(cvg2))

## ---------------------------------------------------------------------
## E. SOME MATHEMATICAL PROPERTIES OF THE coverage() FUNCTION
## ---------------------------------------------------------------------

## PROPERTY 1: The coverage vector is not affected by reordering the
## input ranges:
set.seed(24)
x <- IRanges(sample(1000, 40, replace=TRUE), width=17:10)
cvg0 <- coverage(x)
stopifnot(identical(coverage(sample(x)), cvg0))

## Of course, if the ranges are shifted and/or assigned weights, then
## this doesn't hold anymore, unless the 'shift' and/or 'weight'
## arguments are reordered accordingly.

## PROPERTY 2: The coverage of the concatenation of 2 IntegerRanges
## objects 'x' and 'y' is the sum of the 2 individual coverage vectors:
y <- IRanges(sample(-20:280, 36, replace=TRUE), width=28)
stopifnot(identical(coverage(c(x, y), width=100),
                    coverage(x, width=100) + coverage(y, width=100)))

## Note that, because adding 2 vectors in R recycles the shortest to
## the length of the longest, the following is generally FALSE:
identical(coverage(c(x, y)), coverage(x) + coverage(y))  # FALSE

## It would only be TRUE if the 2 coverage vectors that we add had the
## same length, which would only happen by chance. By using the same
## 'width' value when we computed the 2 coverages previously, we made
## sure they had the same length.

## Because of properties 1 & 2, we have:
x1 <- x[c(TRUE, FALSE)]  # pick up 1st, 3rd, 5th, etc... ranges
x2 <- x[c(FALSE, TRUE)]  # pick up 2nd, 4th, 6th, etc... ranges
cvg1 <- coverage(x1, width=100)
cvg2 <- coverage(x2, width=100)
stopifnot(identical(coverage(x, width=100), cvg1 + cvg2))

## PROPERTY 3: Multiplying the weights by a scalar has the effect of
## multiplying the coverage vector by the same scalar:
weight <- runif(40)
cvg3 <- coverage(x, weight=weight)
stopifnot(all.equal(coverage(x, weight=-2.68 * weight), -2.68 * cvg3))

## Because of properties 1 & 2 & 3, we have:
stopifnot(identical(coverage(x, width=100, weight=c(5L, -11L)),
                    5L * cvg1 - 11L * cvg2))

## PROPERTY 4: Using the sum of 2 weight vectors produces the same
## result as using the 2 weight vectors separately and summing the
## 2 results:
weight2 <- 10 * runif(40) + 3.7
stopifnot(all.equal(coverage(x, weight=weight + weight2),
                    cvg3 + coverage(x, weight=weight2)))

## PROPERTY 5: Repeating any input range N number of times is
## equivalent to multiplying its assigned weight by N:
times <- sample(0:10L, length(x), replace=TRUE)
stopifnot(all.equal(coverage(rep(x, times), weight=rep(weight, times)),
                    coverage(x, weight=weight * times)))

## In particular, if 'weight' is not supplied:
stopifnot(identical(coverage(rep(x, times)), coverage(x, weight=times)))

## PROPERTY 6: If none of the input range actually gets clipped during
## the "shift and clip" process, then:
##
##     sum(cvg) = sum(width(x) * weight)
##
stopifnot(sum(cvg3) == sum(width(x) * weight))

## In particular, if 'weight' is not supplied:
stopifnot(sum(cvg0) == sum(width(x)))

## Note that this property is sometimes used in the context of a
## ChIP-Seq analysis to estimate "the number of reads in a peak", that
## is, the number of short reads that belong to a peak in the coverage
## vector computed from the genomic locations (a.k.a. genomic ranges)
## of the aligned reads. Because of property 6, the number of reads in
## a peak is approximately the area under the peak divided by the short
## read length.

## PROPERTY 7: If 'weight' is not supplied, then disjoining or reducing
## the ranges before calling coverage() has the effect of "shaving" the
## coverage vector at elevation 1:
table(cvg0)
shaved_cvg0 <- cvg0
runValue(shaved_cvg0) <- pmin(runValue(cvg0), 1L)
table(shaved_cvg0)

stopifnot(identical(coverage(disjoin(x)), shaved_cvg0))
stopifnot(identical(coverage(reduce(x)), shaved_cvg0))

## ---------------------------------------------------------------------
## F. SOME SANITY CHECKS
## ---------------------------------------------------------------------
dummy_coverage <- function(x, shift=0L, width=NULL)
{
    y <- IRanges:::unlist_as_integer(shift(x, shift))
    if (is.null(width))
        width <- max(c(0L, y))
    Rle(tabulate(y,  nbins=width))
}

check_real_vs_dummy <- function(x, shift=0L, width=NULL)
{
    res1 <- coverage(x, shift=shift, width=width)
    res2 <- dummy_coverage(x, shift=shift, width=width)
    stopifnot(identical(res1, res2))
}
check_real_vs_dummy(x)
check_real_vs_dummy(x, shift=7)
check_real_vs_dummy(x, shift=7, width=27)
check_real_vs_dummy(x, shift=c(-4, 2))
check_real_vs_dummy(x, shift=c(-4, 2), width=12)
check_real_vs_dummy(x, shift=-max(end(x)))

## With a set of distinct single positions:
x3 <- IRanges(sample(50000, 20000), width=1)
stopifnot(identical(sort(start(x3)), which(coverage(x3) != 0L)))

Bioconductor/IRanges documentation built on Feb. 16, 2025, 2:19 p.m.