View source: R/classifySingleR.R
classifySingleR | R Documentation |
Assign labels to each cell in a test dataset, using a pre-trained classifier combined with an iterative fine-tuning approach.
classifySingleR(
test,
trained,
quantile = 0.8,
fine.tune = TRUE,
tune.thresh = 0.05,
sd.thresh = NULL,
prune = TRUE,
assay.type = "logcounts",
check.missing = FALSE,
num.threads = bpnworkers(BPPARAM),
BPPARAM = SerialParam()
)
test |
A numeric matrix of single-cell expression values where rows are genes and columns are cells. Each row should be named with the gene name. Alternatively, a SummarizedExperiment object containing such a matrix. |
trained |
A List containing the output of the Alternatively, a List of Lists produced by |
quantile |
A numeric scalar specifying the quantile of the correlation distribution to use to compute the score for each label. |
fine.tune |
A logical scalar indicating whether fine-tuning should be performed. |
tune.thresh |
A numeric scalar specifying the maximum difference from the maximum correlation to use in fine-tuning. |
sd.thresh |
Deprecated and ignored. |
prune |
A logical scalar indicating whether label pruning should be performed. |
assay.type |
Integer scalar or string specifying the matrix of expression values to use if |
check.missing |
Deprecated and ignored, as any row filtering will cause mismatches with the |
num.threads |
Integer scalar specifying the number of threads to use for classification. |
BPPARAM |
A BiocParallelParam object specifying the parallelization scheme to use for |
Consider each cell in the test set test
and each label in the training set.
We compute Spearman's rank correlations between the test cell and all cells in the training set with the given label,
based on the expression profiles of the genes selected by trained
.
The score is defined as the quantile of the distribution of correlations, as specified by quantile
.
(Technically, we avoid explicitly computing all correlations by using a nearest neighbor search, but the resulting score is the same.)
After repeating this across all labels, the label with the highest score is used as the prediction for that cell.
If fine.tune=TRUE
, an additional fine-tuning step is performed for each cell to improve resolution.
We identify all labels with scores that are no more than tune.thresh
below the maximum score.
These labels are used to identify a fresh set of marker genes, and the calculation of the score is repeated using only these genes.
The aim is to refine the choice of markers and reduce noise when distinguishing between closely related labels.
The best and next-best scores are reported in the output for use in diagnostics, e.g., pruneScores
.
The default assay.type
is set to "logcounts"
simply for consistency with trainSingleR
.
In practice, the raw counts (for UMI data) or the transcript counts (for read count data) can also be used without normalization and log-transformation.
Any monotonic transformation will have no effect the calculation of the correlation values other than for some minor differences due to numerical precision.
If prune=TRUE
, label pruning is performed as described in pruneScores
with default arguments.
This aims to remove low-quality labels that are ambiguous or correspond to misassigned cells.
However, the default settings can be somewhat aggressive and discard otherwise useful labels in some cases - see ?pruneScores
for details.
A DataFrame where each row corresponds to a cell in test
.
In the case of a single reference, this contains:
scores
, a numeric matrix of correlations at the specified quantile
for each label (column) in each cell (row).
This will contain NA
s if multiple references were supplied to trainSingleR
.
labels
, a character vector containing the predicted label.
If fine.tune=FALSE
, this is based only on the maximum entry in scores
.
delta.next
, a numeric vector containing the difference between tbe best and next-best score.
If fine.tune=TRUE
, this is reported for scores after fine-tuning.
pruned.labels
, a character vector containing the pruned labels where “low-quality” labels are replaced with NA
s.
Only added if prune=TRUE
.
The metadata
of the DataFrame contains:
common.genes
, a character vector of genes used to compute the correlations prior to fine-tuning.
de.genes
, a list of list of character vectors, containing the genes used to distinguish between each pair of labels.
If trained
was generated from multiple references,
the per-reference statistics are automatically combined into a single DataFrame of results using combineRecomputedResults
.
The output of combineRecomputedResults
is then directly returned.
Aaron Lun, based on the original SingleR
code by Dvir Aran.
trainSingleR
, to prepare the training set for classification.
pruneScores
, to remove low-quality labels based on the scores.
combineCommonResults
, to combine results from multiple references.
# Mocking up data with log-normalized expression values:
ref <- .mockRefData()
test <- .mockTestData(ref)
ref <- scuttle::logNormCounts(ref)
test <- scuttle::logNormCounts(test)
# Setting up the training:
trained <- trainSingleR(ref, label=ref$label)
# Performing the classification:
pred <- classifySingleR(test, trained)
table(predicted=pred$labels, truth=test$label)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.