knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
\vspace{.1in}
Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as “features”), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features.
suppressMessages(library(FEAST))
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) ixs = FEAST(Y, k=k) # look at the features Ynorm = Norm_Y(Y) par(mfrow = c(3,3)) for (i in 1:9){ tmp_ix = ixs[i] tmp_gene = rownames(Ynorm)[tmp_ix] boxplot(as.numeric(Ynorm[tmp_ix, ])~trueclass, main = tmp_gene, xlab="", ylab="", las=2) }
FEAST requires a count expression matrix with rows representing the genes and columns representing the cells. To preprocess the count matrix, FEAST filters out the genes depending on the dropout rates (sometimes known as zero-expression rates). Here, we use one template Deng
[@deng2014single] dataset. It is important to note that FEAST requires the number of clusters (k) as an input, which can be obtained by the prior knowledge of the cell types.
The Deng
dataset is properly stored as a SingleCellExperiment object (Deng). The user can access the phenotype information by using colData(Deng)
function, and the count expression matrix by using assay(Deng,"counts")
function from the SummarizedExperiment Bioconductor package. The sample data can be downloaded at https://drive.google.com/drive/u/0/folders/1SRT7mrX7ziJoSjuFLLkK8kjnUsJrabVM.
load("Pathto/Deng.RData") Deng # load the Deng dataset, which includes 6 cell types (268 cells). trueclass = colData(Deng)$cellTypes k = length(unique(trueclass)) Y = assays(Deng)$counts
To preprocess the count expression matrix (Y
), FEAST filters out the genes based on the dropout rate (sometimes known as zero-expression rate). This consensus clustering step will output the initial clustering labels based on a modified CSPA algorithm. It will find the cells that tightly clustered together and possibly filter out the cells that are less correlated to the initial clustering centers.
#Y = process_Y(Y, thre = 2) # preprocess the data if needed con_res = Consensus(Y, k=k)
After the consensus clustering step, FEAST will generate the initial clustering labels with confidence. Based on the initial clustering outcomes, FEAST further infers the gene-level significance by using F-statistics followed by the ranking of the full list of genes.
F_res = cal_F2(Y, con_res$cluster) ixs = order(F_res$F_scores, decreasing = TRUE) # order the features
After ranking the genes by F-statistics, FEAST curates a series of top number of features (genes). Then, FEAST input these top (by default top = 500, 1000, 2000
) features into some established clustering algorithm (e.g. SC3 [@kiselev2017sc3]). It is worth noting that one needs to confirm the k to be the same for every iteration.
After the clustering steps, with the case of unknown cell types, FEAST evaluates the quality of the feature set by computing the average distance between each cell and the clustering centers, which is the same as the MSE. It is noting that FEAST uses all the features (genes) for distance calculation for the purpose of fair comparisons. The clustering centers are obtained by using previous clustering step.
## clustering step tops = c(500, 1000, 2000) cluster_res = NULL for (top in tops){ tmp_ixs = ixs[1:top] tmp_markers = rownames(Y)[tmp_ixs] tmp_res = SC3_Clust(Y, k = k, input_markers = tmp_markers) cluster_res[[toString(top)]] = tmp_res } ## validation step Ynorm = Norm_Y(Y) mse_res = NULL for (top in names(cluster_res)){ tmp_res = cluster_res[[top]] tmp_cluster = tmp_res$cluster tmp_mse = cal_MSE(Ynorm = Ynorm, cluster = tmp_cluster) mse_res = c(mse_res, tmp_mse) } names(mse_res) = names(cluster_res)
After the validation step, the feature set associated with the smallest MSE value will be recommended for the further analysis. Here, we demonstrate that an improved clustering accuracy by specifying the optimal feature set. The benchmark comparison is the original setting of established clustering algorithm (e.g. SC3).
SC3_original = SC3_Clust(Y, k=k) id = which.min(mse_res) eval_Cluster(SC3_original$cluster, trueclass) eval_Cluster(cluster_res[[id]]$cluster, trueclass)
SC3_original = SC3_Clust(Y, k=k) id = which.min(mse_res) eval_Cluster(SC3_original$cluster, trueclass) eval_Cluster(cluster_res[[id]]$cluster, trueclass)
As demonstrated in the figure below, the first panel includes the PCA illustration of the cell types. The second panel shows the clustering result by the original SC3, in which some cells (inside the gray circle) are mixed. The third panel shows the improved clustering outcomes by inputing the optimized feature set into SC3.
{ width="600"}
In the second scenario, the numbers of cells for distinct cell types can be very different, so the power for DE highly depends on the mixing proportions.
Y = assays(Deng)$counts trueclass = Deng$cellTypes k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data con_res = Consensus(Y, k=k) mod_res = Select_Model_short_SC3(Y, cluster = con_res$cluster, top = c(200, 500, 1000, 2000)) # to visualize the result, one needs to load ggpubr library. library(ggpubr) Visual_Rslt(model_cv_res = mod_res, trueclass = trueclass)
The result figure looks like the following:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.