cleanTagCounts: Clean a tag-based dataset

cleanTagCountsR Documentation

Clean a tag-based dataset

Description

Remove low-quality libraries from a count matrix where each row is a tag and each column corresponds to a cell-containing barcode.

Usage

cleanTagCounts(x, ...)

## S4 method for signature 'ANY'
cleanTagCounts(
  x,
  controls,
  ...,
  ambient = NULL,
  exclusive = NULL,
  sparse.prop = 0.5
)

## S4 method for signature 'SummarizedExperiment'
cleanTagCounts(x, ..., assay.type = "counts")

Arguments

x

A numeric matrix-like object containing counts for each tag (row) in each cell (column). Alternatively, a SummarizedExperiment containing such a matrix.

...

For the generic, further arguments to pass to individual methods.

For the SummarizedExperiment, further arguments to pass to the ANY method.

For the ANY method, further arguments to pass to isOutlier. This includes batch to account for multi-batch experiments, and nmads to specify the stringency of the outlier-based filter.

controls

A vector specifying the rows of x corresponding to control tags. These are expected to be isotype controls that should not exhibit any real binding.

ambient

A numeric vector of length equal to nrow(x), containing the relative concentration of each tag in the ambient solution. Defaults to ambientProfileBimodal(x) if not explicitly provided.

exclusive

A character vector of names of mutually exclusive tags that should never be expressed on the same cell. Alternatively, a list of vectors of mutually exclusive sets of tags - see ambientContribNegative for details.

sparse.prop

Numeric scalar specifying the minimum proportion of tags that should be present per cell.

assay.type

Integer or string specifying the assay containing the count matrix.

Details

We remove cells for which there is no detectable ambient contamination. Specifically, we expect non-zero counts for most tags due to the deeply sequenced nature of tag-based data. If sparse.prop or more tags have zero counts, this is indicative of a failure in library preparation for that cell.

We also remove cells for which the total control count is unusually high. The control coverage is used as a proxy for non-specific binding, most notably from contamination of droplets by protein aggregates. High levels of non-specific activity are undesirable as this masks the actual marker profile of affected cells. The upper threshold is defined with isOutlier on the log-total control count.

If controls is missing, we instead compute the ambient scaling factor for each cell. This represents the amount of ambient contamination - see ?ambientContribSparse for more details - and cells with unusually high values are assumed to be affected by protein aggregates. High outliers are again identified and removed based on the log-ambient scale.

If controls is missing and exclusive is specified, the ambient scaling factor is computed by ambientContribNegative instead. This can be helpful for explicitly removing cells with impossible marker combinations, though it is only as comprehensive as the knowledge of mutually exclusive marker sets.

Value

A DataFrame with one row per column of x, containing the following fields:

  • zero.ambient, a logical field indicating whether each cell has zero ambient contamination.

  • sum.controls, a numeric field containing the sum of counts for all control features. Only present if controls is supplied.

  • high.controls, a logical field indicating whether each cell has unusually high control total. Only present if controls is supplied.

  • ambient.scale, a numeric field specifying the relative amount of ambient contamination. Only present if controls is not supplied.

  • high.ambient, a numeric field indicating whether each cell has unusually high ambient contamination. Only present if controls is not supplied.

  • discard, a logical field indicating whether a column in x should be discarded.

Author(s)

Aaron Lun

See Also

ambientContribSparse, to estimate the ambient contamination for each droplet.

isOutlier, to identify the outliers in a distribution of values.

Examples

x <- rbind(
    rpois(1000, rep(c(100, 10), c(100, 900))),
    rpois(1000, rep(c(20, 100, 20), c(100, 100, 800))),
    rpois(1000, rep(c(30, 100, 30), c(200, 700, 100)))
)

# Adding a zero-ambient column plus a high-ambient column.
x <- cbind(0, x, 1000)

df <- cleanTagCounts(x)
df


MarioniLab/DropletUtils documentation built on Oct. 12, 2024, 5:40 p.m.