In MarioniLab/geneBasisR: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(geneBasisR)
library(SingleCellExperiment)
library(tibble)
library(ggplot2)
library(ggpubr)

Introduction

geneBasis is an iterative greedy approach for gene selection. It strives a gene selection that will preserve transcriptional relationships between cells, represented as kNN-graphs.

Here we exemplify how geneBasis can be utilized for the design of targeted gene panels (just main steps; for the extended workflow, covering various aspects of gene selection process, please see tutorials here: https://github.com/MarioniLab/geneBasisR).

Load data

We will be working with mouse embryo, E8.5. Initial filtering for uninteresting genes was already performed.

data("sce_mouseEmbryo", package = "geneBasisR")

# Fetch meta-data and umap coordinates, merge together
meta = as.data.frame(colData(sce_mouseEmbryo))
umaps = as.data.frame(reducedDim(sce_mouseEmbryo, "UMAP"))
umaps = rownames_to_column(umaps, var = "cell")
meta = merge(meta, umaps)

# Fetch colour scheme for celltypes from meta-data
celltype_colors_df = unique(meta[, c("celltype" , "colour")])
celltype_colors = c(paste0("#",as.character(celltype_colors_df$colour)))
names(celltype_colors) = as.character(celltype_colors_df$celltype)

Cell type composition for mouse embryo

Let's see which cell types (and in which quantities) we observe in mouse embryo.

# celltype composition
tab = as.data.frame(table(sce_mouseEmbryo$celltype))
colnames(tab) = c("celltype" , "n")
tab = tab[order(tab$n) , ]
tab$celltype = factor(tab$celltype , levels = tab$celltype)
p1 = ggplot(tab , aes(x = celltype , y = log2(n) , fill = celltype)) +
  geom_bar(stat = "identity" , position = "dodge") +
  scale_fill_manual(values = celltype_colors) +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(y = "log2(# cells)", x = "Cell type")

# UMAP
p2 = ggplot(meta , aes(x = x , y = y , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("UMAP")

# combine
p = ggarrange(p1,p2, common.legend = T)
p

Select gene panel of size 10

To select gene panel, use gene_search.

The main arguments of gene_search include:

genes_base: string specifying pre-selected genes to be included in the panel. Default genes_base = NULL meaning that no genes are selected. In this vignette we pre-computed first 5 genes, and we will plug them as an argument for genes_base: genes_base = c("Hba-x", "Acta2", "Ttr", "Crabp1" , "Hoxaas3").
n_genes_total: a scalar specifying number of genes to be selected.
batch: character vector specifying batch (if known). Should be in colData of SingleCellExperiment object. Default batch = NULL meaning no batches are specified. Here we use sample as batch identification: batch = "sample.
genes.discard: string specifying pre-selected genes to be explicitly discarded from the panel. Default genes.discard = NULL meaning that no genes are discarded.
genes.discard_prefix: string specifying prefix for genes to be explicitly discarded from the panel (e.g. "Rpl" for large ribosomal subunit if no ribosomal genes are of interest). Default genes.discard_prefix = NULL meaning that no genes are discarded.

n_genes_total = 10
genes_stat = gene_search(sce_mouseEmbryo , genes_base = c("Hba-x", "Acta2", "Ttr", "Crabp1" , "Hoxaas3"), n_genes_total = n_genes_total, batch = "sample", verbose = T)
genes = genes_stat$gene

Evaluation of the gene panel

Cell type confusion matrix

Cell type mapping for a single gene panel can be computed separately using get_celltype_mapping.

We also included visualization function that returns heatmap for the confusion matrix: plot_mapping_heatmap.

celltype_mapping = get_celltype_mapping(sce_mouseEmbryo , genes.selection = genes, batch = "sample", return.stat = F)
p = plot_mapping_heatmap(celltype_mapping$mapping, title = "Cell type confusion matrix")
p

Cell score

Henceforth, to be less wordy, we refer to cell neighborhood preservation score as cell score; gene prediction score as gene score.

We estimate the quality of the panel using evaluate_library. In this example, we will only compute cell score (see extended tutorials for more in-depth analysis).

stat = evaluate_library(sce_mouseEmbryo, genes, genes.all = rownames(sce_mouseEmbryo), batch = "sample", 
                        library.size_type = "single", celltype.id = "celltype",
                        return.cell_score_stat = T, return.gene_score_stat = F, return.celltype_stat = F, verbose = FALSE)

Cell score - distribution per cell type

Let's assess whether certain cell types are consistently 'under-preserved' by looking at the distribution of cell scores across cell types.

cell_score_stat = stat$cell_score_stat[stat$cell_score_stat$n_genes == n_genes_total , ] 
cell_score_stat = merge(cell_score_stat, meta) 

p = ggplot(cell_score_stat , aes(x = celltype , y = cell_score, fill = celltype)) + 
  geom_boxplot() + 
  scale_fill_manual(values = celltype_colors) + 
  theme_classic() + 
  labs(y = "Cell neighborhood preservation score" , x = "# genes") + 
  theme(legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 
p

We see that PGCs and blood progenitors in average have lower scores. This is consistent with observation from Fig. 2.

To note, it is not particularly surprising since PGC and blood progenitors are very rare cell types, and our panel size is only 10. Therefore we suggest that more genes should be included into the panel to successfully resolve these cell types.

Characterisation of the selected genes

Let's actually look into what genes we select.

What are the genes we are selecting - visual inspection

Let's plot UMAPs, colored by expression of the selected genes, to have an intuitive grasp of what kind of genes we select. UMAPs can be easily plotted with provided function plot_umaps_w_counts. Note that we require UMAP coordinates to be stored in reducedDim(sce) under the name 'UMAP', and colnames for the coordinates should be 'x' and 'y'.

If UMAPs are not provided, you can compute them using get_umap_coordinates and then store it in sce.

p = plot_umaps_w_counts(sce_mouseEmbryo, genes)
p

Overall we see that most of the genes are expressed in some transcriptional regions but not others (i.e. like cell type markers behavior).

We also select cell state marker - cell cycle gene Ube2c.

UMAP-representation on how well data is preserved for the selected panel

Let's compare UMAP plots when using whole transcriptome and using the selected panel (UMAP-coordinates can be calculated using get_umap_coordinates).

For consistency, e will re-calculate UMAP coordinates for the whole transcriptome as well.

umaps_all = get_umap_coordinates(sce_mouseEmbryo, genes = rownames(sce_mouseEmbryo), batch = "sample", nPC = 50)
colnames(umaps_all) = c("cell", "x_all", "y_all")

umaps_selection = get_umap_coordinates(sce_mouseEmbryo, genes = genes, batch = "sample")
colnames(umaps_selection) = c("cell", "x_selection", "y_selection")

umaps_combined = merge(umaps_all, umaps_selection)

meta = as.data.frame(colData(sce_mouseEmbryo))
meta = merge(meta, umaps_combined)

# plot
p1 = ggplot(meta , aes(x = x_all , y = y_all , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("All genes")
p2 = ggplot(meta , aes(x = x_selection , y = y_selection , col = celltype)) +
  geom_point(size=1,alpha = .9) +
  scale_color_manual(values = celltype_colors) +
  theme_classic() +
  labs(x = "UMAP-1" , y = "UMAP-2") +
  ggtitle("All genes")
p = ggarrange(p1,p2,common.legend = T)
p

Average expression across cell types and co-expression

Let's look a bit more systematically into how cell type specific are expressions of the selected genes.

p = plot_expression_heatmap(sce_mouseEmbryo, genes = genes, value.type = "mean") 
p

Also, let's look in overall co-expression between selected genes.

p = plot_coexpression(sce_mouseEmbryo, genes = genes) 
p

Estimation of redundancy

Finally, let's estimate the redundancy of the gene panel on gene/celltype level: for each gene, we temporarily remove it from the panel and compare the accuracy of cell type mappings with and without removed gene (per cell type).

The function to get this stat is get_redundancy_stat. We also provide the function that will plot the results as a heatmap: plot_redundancy_stat. Red corresponds to cases where cell type mapping accuracy drops with the removal of the gene, green - where cell type mapping accuracy increases with the removal of the gene (that can occur for cell type that were not well mapped to begin with).

redundancy_stat = get_redundancy_stat(sce_mouseEmbryo, genes, genes_to_assess = genes, batch = "sample") 
p = plot_redundancy_stat(redundancy_stat) 
p

From this analysis, we can conclude the particular relevance of some genes for some cell types e.g. Plvap for Endothelium, Acta2 for Cardiomyocytes, etc.

This line of analysis also allows to estimate overall redundancy of the panel which is useful when planning -FISH experiments.

Session Info

sessionInfo()

MarioniLab/geneBasisR documentation built on June 30, 2023, 2:04 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MarioniLab/geneBasisR
What the Package Does (One Line, Title Case)

In MarioniLab/geneBasisR: What the Package Does (One Line, Title Case)

Introduction

Load data

Cell type composition for mouse embryo

Select gene panel of size 10

Evaluation of the gene panel

Cell type confusion matrix

Cell score

Cell score - distribution per cell type

Characterisation of the selected genes

What are the genes we are selecting - visual inspection

UMAP-representation on how well data is preserved for the selected panel

Average expression across cell types and co-expression

Estimation of redundancy

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

MarioniLab/geneBasisR What the Package Does (One Line, Title Case)

In MarioniLab/geneBasisR: What the Package Does (One Line, Title Case)

Introduction

Load data

Cell type composition for mouse embryo

Select gene panel of size 10

Evaluation of the gene panel

Cell type confusion matrix

Cell score

Cell score - distribution per cell type

Characterisation of the selected genes

What are the genes we are selecting - visual inspection

UMAP-representation on how well data is preserved for the selected panel

Average expression across cell types and co-expression

Estimation of redundancy

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

MarioniLab/geneBasisR
What the Package Does (One Line, Title Case)