knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(geneBasisR) library(SingleCellExperiment) library(tibble) library(ggplot2) library(ggpubr)
geneBasis
is an iterative greedy approach for gene selection. It strives a gene selection that will preserve transcriptional relationships between cells, represented as kNN-graphs.
Here we exemplify how geneBasis
can be utilized for the design of targeted gene panels (just main steps; for the extended workflow, covering various aspects of gene selection process, please see tutorials here: https://github.com/MarioniLab/geneBasisR).
We will be working with mouse embryo, E8.5. Initial filtering for uninteresting genes was already performed.
data("sce_mouseEmbryo", package = "geneBasisR") # Fetch meta-data and umap coordinates, merge together meta = as.data.frame(colData(sce_mouseEmbryo)) umaps = as.data.frame(reducedDim(sce_mouseEmbryo, "UMAP")) umaps = rownames_to_column(umaps, var = "cell") meta = merge(meta, umaps) # Fetch colour scheme for celltypes from meta-data celltype_colors_df = unique(meta[, c("celltype" , "colour")]) celltype_colors = c(paste0("#",as.character(celltype_colors_df$colour))) names(celltype_colors) = as.character(celltype_colors_df$celltype)
Let's see which cell types (and in which quantities) we observe in mouse embryo.
# celltype composition tab = as.data.frame(table(sce_mouseEmbryo$celltype)) colnames(tab) = c("celltype" , "n") tab = tab[order(tab$n) , ] tab$celltype = factor(tab$celltype , levels = tab$celltype) p1 = ggplot(tab , aes(x = celltype , y = log2(n) , fill = celltype)) + geom_bar(stat = "identity" , position = "dodge") + scale_fill_manual(values = celltype_colors) + theme_classic() + theme(axis.text.x = element_blank()) + labs(y = "log2(# cells)", x = "Cell type") # UMAP p2 = ggplot(meta , aes(x = x , y = y , col = celltype)) + geom_point(size=1,alpha = .9) + scale_color_manual(values = celltype_colors) + theme_classic() + labs(x = "UMAP-1" , y = "UMAP-2") + ggtitle("UMAP") # combine p = ggarrange(p1,p2, common.legend = T) p
To select gene panel, use gene_search
.
The main arguments of gene_search
include:
genes_base
: string specifying pre-selected genes to be included in the panel. Default genes_base = NULL
meaning that no genes are selected. In this vignette we pre-computed first 5 genes, and we will plug them as an argument for genes_base
: genes_base = c("Hba-x", "Acta2", "Ttr", "Crabp1" , "Hoxaas3")
.
n_genes_total
: a scalar specifying number of genes to be selected.
batch
: character vector specifying batch (if known). Should be in colData of SingleCellExperiment object. Default batch = NULL
meaning no batches are specified. Here we use sample
as batch identification: batch = "sample
.
genes.discard
: string specifying pre-selected genes to be explicitly discarded from the panel. Default genes.discard = NULL
meaning that no genes are discarded.
genes.discard_prefix
: string specifying prefix for genes to be explicitly discarded from the panel (e.g. "Rpl" for large ribosomal subunit if no ribosomal genes are of interest). Default genes.discard_prefix = NULL
meaning that no genes are discarded.
n_genes_total = 10 genes_stat = gene_search(sce_mouseEmbryo , genes_base = c("Hba-x", "Acta2", "Ttr", "Crabp1" , "Hoxaas3"), n_genes_total = n_genes_total, batch = "sample", verbose = T) genes = genes_stat$gene
Cell type mapping for a single gene panel can be computed separately using get_celltype_mapping
.
We also included visualization function that returns heatmap for the confusion matrix: plot_mapping_heatmap
.
celltype_mapping = get_celltype_mapping(sce_mouseEmbryo , genes.selection = genes, batch = "sample", return.stat = F) p = plot_mapping_heatmap(celltype_mapping$mapping, title = "Cell type confusion matrix") p
Henceforth, to be less wordy, we refer to cell neighborhood preservation score as cell score; gene prediction score as gene score.
We estimate the quality of the panel using evaluate_library
. In this example, we will only compute cell score (see extended tutorials for more in-depth analysis).
stat = evaluate_library(sce_mouseEmbryo, genes, genes.all = rownames(sce_mouseEmbryo), batch = "sample", library.size_type = "single", celltype.id = "celltype", return.cell_score_stat = T, return.gene_score_stat = F, return.celltype_stat = F, verbose = FALSE)
Let's assess whether certain cell types are consistently 'under-preserved' by looking at the distribution of cell scores across cell types.
cell_score_stat = stat$cell_score_stat[stat$cell_score_stat$n_genes == n_genes_total , ] cell_score_stat = merge(cell_score_stat, meta) p = ggplot(cell_score_stat , aes(x = celltype , y = cell_score, fill = celltype)) + geom_boxplot() + scale_fill_manual(values = celltype_colors) + theme_classic() + labs(y = "Cell neighborhood preservation score" , x = "# genes") + theme(legend.position = "none") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) p
We see that PGCs and blood progenitors in average have lower scores. This is consistent with observation from Fig. 2.
To note, it is not particularly surprising since PGC and blood progenitors are very rare cell types, and our panel size is only 10. Therefore we suggest that more genes should be included into the panel to successfully resolve these cell types.
Let's actually look into what genes we select.
Let's plot UMAPs, colored by expression of the selected genes, to have an intuitive grasp of what kind of genes we select. UMAPs can be easily plotted with provided function plot_umaps_w_counts
. Note that we require UMAP coordinates to be stored in reducedDim(sce) under the name 'UMAP', and colnames for the coordinates should be 'x' and 'y'.
If UMAPs are not provided, you can compute them using get_umap_coordinates
and then store it in sce.
p = plot_umaps_w_counts(sce_mouseEmbryo, genes) p
Overall we see that most of the genes are expressed in some transcriptional regions but not others (i.e. like cell type markers behavior).
We also select cell state marker - cell cycle gene Ube2c.
Let's compare UMAP plots when using whole transcriptome and using the selected panel (UMAP-coordinates can be calculated using get_umap_coordinates
).
For consistency, e will re-calculate UMAP coordinates for the whole transcriptome as well.
umaps_all = get_umap_coordinates(sce_mouseEmbryo, genes = rownames(sce_mouseEmbryo), batch = "sample", nPC = 50) colnames(umaps_all) = c("cell", "x_all", "y_all") umaps_selection = get_umap_coordinates(sce_mouseEmbryo, genes = genes, batch = "sample") colnames(umaps_selection) = c("cell", "x_selection", "y_selection") umaps_combined = merge(umaps_all, umaps_selection) meta = as.data.frame(colData(sce_mouseEmbryo)) meta = merge(meta, umaps_combined) # plot p1 = ggplot(meta , aes(x = x_all , y = y_all , col = celltype)) + geom_point(size=1,alpha = .9) + scale_color_manual(values = celltype_colors) + theme_classic() + labs(x = "UMAP-1" , y = "UMAP-2") + ggtitle("All genes") p2 = ggplot(meta , aes(x = x_selection , y = y_selection , col = celltype)) + geom_point(size=1,alpha = .9) + scale_color_manual(values = celltype_colors) + theme_classic() + labs(x = "UMAP-1" , y = "UMAP-2") + ggtitle("All genes") p = ggarrange(p1,p2,common.legend = T) p
Let's look a bit more systematically into how cell type specific are expressions of the selected genes.
p = plot_expression_heatmap(sce_mouseEmbryo, genes = genes, value.type = "mean") p
Also, let's look in overall co-expression between selected genes.
p = plot_coexpression(sce_mouseEmbryo, genes = genes) p
Finally, let's estimate the redundancy of the gene panel on gene/celltype level: for each gene, we temporarily remove it from the panel and compare the accuracy of cell type mappings with and without removed gene (per cell type).
The function to get this stat is get_redundancy_stat
. We also provide the function that will plot the results as a heatmap: plot_redundancy_stat
. Red corresponds to cases where cell type mapping accuracy drops with the removal of the gene, green - where cell type mapping accuracy increases with the removal of the gene (that can occur for cell type that were not well mapped to begin with).
redundancy_stat = get_redundancy_stat(sce_mouseEmbryo, genes, genes_to_assess = genes, batch = "sample") p = plot_redundancy_stat(redundancy_stat) p
From this analysis, we can conclude the particular relevance of some genes for some cell types e.g. Plvap for Endothelium, Acta2 for Cardiomyocytes, etc.
This line of analysis also allows to estimate overall redundancy of the panel which is useful when planning -FISH experiments.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.