classifySingleR: Classify cells with SingleR

View source: R/classifySingleR.R

classifySingleRR Documentation

Classify cells with SingleR

Description

Assign labels to each cell in a test dataset, using a pre-trained classifier combined with an iterative fine-tuning approach.

Usage

classifySingleR(
  test,
  trained,
  quantile = 0.8,
  fine.tune = TRUE,
  tune.thresh = 0.05,
  sd.thresh = NULL,
  prune = TRUE,
  assay.type = "logcounts",
  check.missing = FALSE,
  num.threads = bpnworkers(BPPARAM),
  BPPARAM = SerialParam()
)

Arguments

test

A numeric matrix of single-cell expression values where rows are genes and columns are cells. Each row should be named with the gene name.

Alternatively, a SummarizedExperiment object containing such a matrix.

trained

A List containing the output of the trainSingleR function. If the row names of test are not exactly the same as the reference dataset, the call to trainSingleR should set test.genes=rownames(test).

Alternatively, a List of Lists produced by trainSingleR for multiple references.

quantile

A numeric scalar specifying the quantile of the correlation distribution to use to compute the score for each label.

fine.tune

A logical scalar indicating whether fine-tuning should be performed.

tune.thresh

A numeric scalar specifying the maximum difference from the maximum correlation to use in fine-tuning.

sd.thresh

Deprecated and ignored.

prune

A logical scalar indicating whether label pruning should be performed.

assay.type

Integer scalar or string specifying the matrix of expression values to use if test is a SummarizedExperiment.

check.missing

Deprecated and ignored, as any row filtering will cause mismatches with the test.genes= used in trainSingleR.

num.threads

Integer scalar specifying the number of threads to use for classification.

BPPARAM

A BiocParallelParam object specifying the parallelization scheme to use for NA scanning, when check.missing=TRUE.

Details

Consider each cell in the test set test and each label in the training set. We compute Spearman's rank correlations between the test cell and all cells in the training set with the given label, based on the expression profiles of the genes selected by trained. The score is defined as the quantile of the distribution of correlations, as specified by quantile. (Technically, we avoid explicitly computing all correlations by using a nearest neighbor search, but the resulting score is the same.) After repeating this across all labels, the label with the highest score is used as the prediction for that cell.

If fine.tune=TRUE, an additional fine-tuning step is performed for each cell to improve resolution. We identify all labels with scores that are no more than tune.thresh below the maximum score. These labels are used to identify a fresh set of marker genes, and the calculation of the score is repeated using only these genes. The aim is to refine the choice of markers and reduce noise when distinguishing between closely related labels. The best and next-best scores are reported in the output for use in diagnostics, e.g., pruneScores.

The default assay.type is set to "logcounts" simply for consistency with trainSingleR. In practice, the raw counts (for UMI data) or the transcript counts (for read count data) can also be used without normalization and log-transformation. Any monotonic transformation will have no effect the calculation of the correlation values other than for some minor differences due to numerical precision.

If prune=TRUE, label pruning is performed as described in pruneScores with default arguments. This aims to remove low-quality labels that are ambiguous or correspond to misassigned cells. However, the default settings can be somewhat aggressive and discard otherwise useful labels in some cases - see ?pruneScores for details.

Value

A DataFrame where each row corresponds to a cell in test. In the case of a single reference, this contains:

  • scores, a numeric matrix of correlations at the specified quantile for each label (column) in each cell (row). This will contain NAs if multiple references were supplied to trainSingleR.

  • labels, a character vector containing the predicted label. If fine.tune=FALSE, this is based only on the maximum entry in scores.

  • delta.next, a numeric vector containing the difference between tbe best and next-best score. If fine.tune=TRUE, this is reported for scores after fine-tuning.

  • pruned.labels, a character vector containing the pruned labels where “low-quality” labels are replaced with NAs. Only added if prune=TRUE.

The metadata of the DataFrame contains:

  • common.genes, a character vector of genes used to compute the correlations prior to fine-tuning.

  • de.genes, a list of list of character vectors, containing the genes used to distinguish between each pair of labels.

If trained was generated from multiple references, the per-reference statistics are automatically combined into a single DataFrame of results using combineRecomputedResults. The output of combineRecomputedResults is then directly returned.

Author(s)

Aaron Lun, based on the original SingleR code by Dvir Aran.

See Also

trainSingleR, to prepare the training set for classification.

pruneScores, to remove low-quality labels based on the scores.

combineCommonResults, to combine results from multiple references.

Examples

# Mocking up data with log-normalized expression values:
ref <- .mockRefData()
test <- .mockTestData(ref)

ref <- scuttle::logNormCounts(ref)
test <- scuttle::logNormCounts(test)

# Setting up the training:
trained <- trainSingleR(ref, label=ref$label)

# Performing the classification:
pred <- classifySingleR(test, trained)
table(predicted=pred$labels, truth=test$label)


LTLA/SingleR documentation built on Nov. 2, 2024, 10:38 a.m.