library("knitr") opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png') op <- options(gvis.plot.tag='chart')
The number of cell atlases is growing rapidly, as a result of advances in single-cell sequencing techniques and cheaper sequencing cost. To search through these large single-cell datasets for analysis, however, is time-consuming and inefficient because these types of dataset usually take up large amounts of memory. scfind
has adopted an efficient compression strategy which makes it suitable for real-time queries of millions of cells.
scfind
is a method for searching specific cell types from large single-cell datasets by a query of gene list, in which scfind
can suggest subquries score by TF-IDF method. scfind
can perform hypergeometric test which allows the evaluation of marker genes specific to each cell type within a dataset. A manuscript describing scfind
in details is available in bioRxiv
SingleCellExperiment
classIf you already have an SingleCellExperiment
object, then proceed to the next chapter.
scfind
is built on top of the Bioconductor’s SingleCellExperiment class. scfind
operates on objects of class SingleCellExperiment
and writes all of its results back to an object. Please read corresponding vignettes on how to create a SingleCellExperiment
from your own data. For illustrative purposes, a list of SingleCellExperiment
objects of the Tabula Muris (FACS)
and the Tabula Muris (10X)
datasets are provided. The investigators (The Tabula Muris Consortium) have profiled almost every cell-type in the mouse using high-coverage FACS-sorted cells + Smartseq2. in the original publication. We will combine the indices of 18 tissues into a super-index.
library("scfind") suppressPackageStartupMessages(library("SingleCellExperiment")) # List of `Tabula Muris (FACS)` `SingleCellExperiment` objects data(tmfacs) # List of `Tabula Muris (10X)` `SingleCellExperiment` objects data(tm10x) # Download and load `SingleCellExperiment` object of the `Heart` dataset sce.heart <- readRDS(url(tmfacs["Heart"])) sce.heart colData(sce.heart)
scfind
InputOnce we have a SingleCellExperiment
object, we can run scfind
. Firstly, we need to build the scfind
index from our input dataset.
By default scfind
uses the cell_type1
column of the colData
slot in the reference to identify cell type names.
scfind.index <- buildCellTypeIndex(sce = sce.heart, cell.type.label = "cell_type1", # If you have your own clustering result or manually defined groups, you may change this parameters with the corresponding column name of the `colData` dataset.name = "Heart", assay.name = "counts") # if the dataset contains more than one category, users can build index for each tissue individually and merge all indices into 1 super-index using `mergeDataset` as following: sce.thymus <- readRDS(url(tmfacs["Thymus"])) scfind.index.new <- buildCellTypeIndex(sce = sce.thymus, cell.type.label = "cell_type1", # If you have your own clustering result or manually defined groups, you may change this parameters with the corresponding column name of the `colData` dataset.name = "Thymus", assay.name = "counts") merged.index <- mergeDataset(scfind.index, scfind.index.new) merged.index
# The `scfind` index can be saved as a RDS object saveObject(object = merged.index, file = "scfindIndex.rds")
merged.index@index$getCellTypeExpression("Thymus.T cell")[1:5, 1:5]
Once the scfind
index is built, one can view all existing genes in the datasets using scfindGenes
sample(scfindGenes(merged.index), 20)
or view all existing cell type names in the database using cellTypeNames
cellTypeNames(merged.index) # To specify the dataset cellTypeNames(merged.index, "Thymus")
Once we have a scfind
index, we can find the cell types that most likely represent your gene set from your dataset very quickly:
For illustrative purposes, we will use a preprocessed scfind index of the Tabula Muris (FACS)
.
# `scfind` indexes of the `Tabula Muris (FACS)` & `Tabula Muris (10X)` datasets data(ExampleIndex) geneIndex <- loadObject(file = url(ExampleIndex["TabulaMurisFACS"])) # try `geneIndex <- loadObject(file = url(ExampleIndex["TabulaMuris10X"]))` for another Tabula Muris Dataset geneIndex@datasets
scfind
is a search engine for single cell datasets. The central operation carried out by scfind
is to identify the set of cells that express a set of genes or peaks (i.e. the query) specified by the user.
query <- c("Il2ra", "Ptprc", "Il7r", "Ctla4") # The p-values is calculated by hypergeometric test hyperQueryCellTypes(object = geneIndex, gene.list = query) # Use the `datasets` argument to specify the datasets result <- hyperQueryCellTypes(object = geneIndex, gene.list = query, datasets = c("Marrow", "Thymus", "Fat")) # The calculation shows that the query gene set is specific for the `Marrow.T cell` cell type with the lowest p-value. barplot(-log10(result$pval), ylab = "-log10(pval)", las = 2, names.arg = result$cell_type)
To allow search of enriched cell type from a long list of gene query, scfind
features a query optimization routin. First, the function markerGenes
will counter suggest subqueries with the highest support in the dataset. The TF-IDF score for each gene set allows user to identify the best subquery for finding the most relevant cell type. We will use the marker genes identified in an original publication Yanbin et al. 2015. Cardiomyocyte-specific markers used in immunostaining.
long.query <- c("Mef2c", "Gata4", "Nkx2.5", "Myh6", "tnnt2", "tnni3", "CDH2", "Cx43", "GJA1") # In which, `scfind` will suggest subqueries for the initial gene sets along with number of cells and cell types that express the genes result <- markerGenes(object = geneIndex, gene.list = long.query) # Showing first 10 rows of the subqueries head(result, 10) subQueriesTfidf <- setNames(result$tfidf, gsub(",", "&", result$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE) subQueriesCells <- setNames(result$Cells, gsub(",", "&", result$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesCells), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
By ranking the tfidf score, the best subquery can be used to search for cell types that is enriched.
bestQuery <- strsplit(as.character(result[which.max(result$tfidf),'Query']), ",")[[1]] bestQuery # The calculation above shows that a list of genes containing `Myh6` and `Tnni3` is specific for the `Heart.cardiac muscle cell` cell type with the lowest p-value. enrichedCellTypes <- hyperQueryCellTypes(object = geneIndex, gene.list = bestQuery) # Showing first 10 rows of the enriched cell types head(enrichedCellTypes, 10)
To further evaluate a specific query by calculating the precision recall metrics
evaluateGenes <- evaluateMarkers(object = geneIndex, gene.list = bestQuery, cell.types = "Heart.cardiac muscle cell") evaluateGenes ggplot2::qplot(precision, recall, data = evaluateGenes,colour = f1, label = genes) + ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
If one is more interested in finding out which marker genes best represent a cell type in the dataset, cellTypeMarkers
function should be used for searching the index:
interestedCellType <- c("Kidney.leukocyte") findMarkers <- cellTypeMarkers(object = geneIndex, cell.types = interestedCellType) ggplot2::qplot(precision, recall, data = findMarkers,colour = f1, label = genes) + ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T) # You can also specify the background cell types findMarkers <- cellTypeMarkers(object = geneIndex, cell.types = interestedCellType, background.cell.types = cellTypeNames(object = geneIndex, datasets = "Kidney")) ggplot2::qplot(precision, recall, data = findMarkers,colour = f1, label = genes) + ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
scfind
keeps track on individual cells. This allows cell sorting by adding the logical operators ("*" as OR, "-" as NOT) in front of each gene name.
## To find cells that has the expression pattern of NOT expressing the gene `Il2ra` and OR expressing the gene `Ptprc` in the Thymus dataset findCellTypes(object = geneIndex, gene.list = c("-Il2ra", "*Ptprc", "Il7r"), datasets = "Thymus") hyperQueryCellTypes(object = geneIndex, gene.list = c("-Il2ra", "*Ptprc", "Il7r"), datasets = "Thymus") ## Without logical operators findCellTypes(object = geneIndex, gene.list = c("Il2ra", "Ptprc", "Il7r"), datasets = "Thymus") hyperQueryCellTypes(object = geneIndex, gene.list = c("Il2ra", "Ptprc", "Il7r"), datasets = "Thymus")
With the function mergeDataset
, users can build a super index which contains multiple datasets for analysis. The example in bioRxiv shows that scfind
is an ideal tool for multimodal analysis. Here, we are using the example super index to demonstrate how one can perform analysis on multi-datasets at the same time.
superIndex <- loadObject(url(ExampleIndex["TabulaMurisSuperIndex"])) ## Now you could view all tissues from both datasets here superIndex@datasets
## And view the cell types here sample(cellTypeNames(superIndex), 20)
Let's use our long query as example again to perform search with the super index.
long.query <- c("Mef2c", "Gata4", "Nkx2.5", "Myh6", "tnnt2", "tnni3", "CDH2", "Cx43", "GJA1") result.superIndex <- markerGenes(object = superIndex, gene.list = long.query) head(result.superIndex, 10) subQueriesTfidf.superIndex <- setNames(result.superIndex$tfidf, gsub(",", "&", result.superIndex$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf.superIndex), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE) subQueriesCells.superIndex <- setNames(result.superIndex$Cells, gsub(",", "&", result.superIndex$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesCells.superIndex), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
To do analysis with multiple cell types, users are allow to select background datasets or cell types as following. For example, we can narrow down our analysis to one of the dataset TMFACS
only.
## Obtain all cell type names in `TM10X` select.dataset <- grep("TM10X-", superIndex@datasets, value = T) select.celltypes <- cellTypeNames(superIndex, select.dataset) ## Run query optimization routine result.superIndex.tm10x <- markerGenes(object = superIndex, gene.list = long.query, dataset = select.dataset) head(result.superIndex.tm10x, 10) subQueriesTfidf.superIndex.tm10x <- setNames(result.superIndex.tm10x$tfidf, gsub(",", "&", result.superIndex.tm10x$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesTfidf.superIndex.tm10x), mainbar.y.label = "TFIDF score", order.by = "freq", show.numbers = FALSE) subQueriesCells.superIndex.tm10x <- setNames(result.superIndex.tm10x$Cells, gsub(",", "&", result.superIndex.tm10x$Query)) UpSetR::upset(UpSetR::fromExpression(subQueriesCells.superIndex.tm10x), mainbar.y.label = "No. of cells", order.by = "freq", show.numbers = FALSE)
Here, with the super index, we can now evaluate the genes in both TM10X
and TMFACS
simultaneously.
## Obtain all cell type names in `TM10X` select.dataset <- grep("TM10X-", superIndex@datasets, value = T) select.celltypes <- cellTypeNames(superIndex, select.dataset) evaluate.genes.tm10x <- evaluateMarkers(object = superIndex, gene.list = long.query, cell.types = "TM10X-Heart.cardiac muscle cell", background.cell.types = select.celltypes) ## using all cells in `TMFACS` as background cell types ## Obtain all cell type names in `TMFACS` select.dataset <- grep("TMFACS-", superIndex@datasets, value = T) select.celltypes <- cellTypeNames(superIndex, select.dataset) evaluate.genes.tmfacs <- evaluateMarkers(object = superIndex, gene.list = long.query, cell.types = "TMFACS-Heart.cardiac muscle cell", background.cell.types = select.celltypes) ## using all cells in `TMFACS` as background cell types evaluate.result <- rbind( data.frame(evaluate.genes.tm10x, dataset = "TM10X"), data.frame(evaluate.genes.tmfacs, dataset = "TMFACS") ) ggplot2::qplot(precision, recall, data = evaluate.result,colour = dataset, size = f1, label = genes) + ggplot2::geom_text(vjust = 0, nudge_y = 5e-3, colour = "black", check_overlap = T)
Note scfind
can also be run in an interactive Shiny
session:
scfindShiny(object = geneIndex)
To enhancer user experience, the Hemberg-lab has an interactive website of 9 collections of scfind
indexes.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.