knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, comment = "#>" )
A common application of single-cell RNA sequencing (RNA-seq) data is
to identify discrete cell types. To take advantage of the large collection
of well-annotated scRNA-seq datasets, scClassify
package implements
a set of methods to perform accurate cell type classification based on
ensemble learning and sample size calculation.
This vignette demonstrates the usage of scClassify
,
providing a pithy description of each method with workable examples.
First, install scClassify
via BiocManager
.
# installation of scClassify if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("scClassify")
We assume that you have log-transformed (size-factor normalized) matrices where each row is a gene and each column a cell for a reference dataset and a query dataset. For demonstration purposes, we will take a subset of single-cell pancreas datasets from two independent studies (Wang et al., and Xin et al.).
library("scClassify") data("scClassify_example") xin_cellTypes <- scClassify_example$xin_cellTypes exprsMat_xin_subset <- scClassify_example$exprsMat_xin_subset wang_cellTypes <- scClassify_example$wang_cellTypes exprsMat_wang_subset <- scClassify_example$exprsMat_wang_subset exprsMat_xin_subset <- as(exprsMat_xin_subset, "dgCMatrix") exprsMat_wang_subset <- as(exprsMat_wang_subset, "dgCMatrix")
The original cell type annotations and compositions of the example datasets can be easily accessed as shown below.
table(xin_cellTypes) table(wang_cellTypes)
We can see that Xin et al. data only have 4 cell types, while Wang et al. has 7 cell types.
We first perform non-ensemble scClassify
by using Xin et al.
as our reference dataset and Wang et al. data as ur query data.
We use WKNN
as the KNN algorithm, DE
(differential expression genes)
as the gene selection method, and lastly pearson
as
the similarity calculation method.
scClassify_res <- scClassify(exprsMat_train = exprsMat_xin_subset, cellTypes_train = xin_cellTypes, exprsMat_test = list(wang = exprsMat_wang_subset), cellTypes_test = list(wang = wang_cellTypes), tree = "HOPACH", algorithm = "WKNN", selectFeatures = c("limma"), similarity = c("pearson"), returnList = FALSE, verbose = FALSE)
We can check the cell type tree generated by the reference data:
scClassify_res$trainRes plotCellTypeTree(cellTypeTree(scClassify_res$trainRes))
Noted that scClassify_res$trainRes
is a scClassifyTrainModel
class.
Check the prediction results.
table(scClassify_res$testRes$wang$pearson_WKNN_limma$predRes, wang_cellTypes)
We next perform ensemble scClassify
by using Xin et al.
as our reference dataset and Wang et al. data as our query data.
We use WKNN
as the KNN algorithm, DE
as the gene selection method,
and pearson
and spearman
as the similarity calculation methods.
Thus, we will generate two combinations of gene selection models and
similarity metrics as training classifiers:
WKNN
+ DE
+ pearson
WKNN
+ DE
+ spearman
Here, we will weight these two classifiers equally by setting
weighted_ensemble = FALSE
. By default this is set as TRUE
,
so each base classifier will be weighted by the accuracy rates trained
in the reference data.
scClassify_res_ensemble <- scClassify(exprsMat_train = exprsMat_xin_subset, cellTypes_train = xin_cellTypes, exprsMat_test = list(wang = exprsMat_wang_subset), cellTypes_test = list(wang = wang_cellTypes), tree = "HOPACH", algorithm = "WKNN", selectFeatures = c("limma"), similarity = c("pearson", "cosine"), weighted_ensemble = FALSE, returnList = FALSE, verbose = FALSE)
We can compare the two base classifiers predictions as below.
table(scClassify_res_ensemble$testRes$wang$pearson_WKNN_limma$predRes, scClassify_res_ensemble$testRes$wang$cosine_WKNN_limma$predRes)
Now, check the final ensemble results:
table(scClassify_res_ensemble$testRes$wang$ensembleRes$cellTypes, wang_cellTypes)
You can also train your own model scClassifyTrainModel
using
train_scClassify()
. Note that by setting weightsCal = TRUE
,
we will calculate the training error of the reference data as
the weights for the individual classifiers.
Here, we illustrate the training function with gene selection methods based on differential expression ("limma") and biomodal distribution ("BI").
trainClass <- train_scClassify(exprsMat_train = exprsMat_xin_subset, cellTypes_train = xin_cellTypes, selectFeatures = c("limma", "BI"), returnList = FALSE )
trainClass
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.