knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Single-cell RNA-sequencing (scRNA-seq) technologies have revealed cell heterogeneity. Marker gene expressions are considered as the most intuitive and informative measurements distinguishing different cell types. Although several computational methods have been proposed to do marker gene selection from clusters or cell types, none of them applied marker gene information on bulk RNA-seq experiments to find cell type or cluster enrichment. Here, we present LRcell package, which uses Logistic/Linear Regression to identify the most transcriptionally enriched cell types or clusters when applying cell marker information to a bulk experiment with p-values measurements of differentially expressed genes.
Pre-Loaded Marker genes information: This package offers marker genes information in 1 human PBMC data, 1 human brain region and 9 mouse brain regions.
This is a R Bioconductor package and it can be installed by using BiocManager
.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ## this will install the BiocManager package BiocManager::install("LRcell")
To check whether LRcell package is successfully installed:
library(LRcell)
Once we have LRcell package loaded, we can start using it to analyze the transcriptional engagement of cell types or clusters. LRcell takes both single-cell marker genes list and p-values of bulk DE genes as input to calculate the enrichment of cell-type specific marker genes in the ranked DE genes.
As mentioned above, LRcell provides single-cell marker genes list in 1 human PBMC data, 1 human brain region (Prefrontal Cortex) and 9 mouse brain regions (Frontal Cortex, Cerebellum, Globus Pallidus, Hippocampus, Entopeduncular, Posterior Cortex, Striatum, Substantia Nigra and Thalamus).
The data is stored in another Bioconductor ExperimentHub package named LRcellTypeMarkers. Users can access the data through ExperimentHub by:
## for installing ExperimentHub # BiocManager::install("ExperimentHub") ## query data library(ExperimentHub) eh <- ExperimentHub::ExperimentHub() eh <- AnnotationHub::query(eh, "LRcellTypeMarkers") ## query for LRcellTypeMarkers package eh ## this will list out EH number to access the calculated gene enrichment scores ## get mouse brain Frontal Cortex enriched genes enriched.g <- eh[["EH4548"]] marker.g <- get_markergenes(enriched.g, method="LR", topn=100)
Users are also encouraged to process a read count matrix with cell annotation information into a gene enrichment scores matrix.
enriched.g <- LRcell_gene_enriched_scores(expr, annot, power=1, parallel=TRUE, n.cores=4)
Here, expr
is a read count matrix with rows as genes and columns as cells. annot
is a named-vector with names as cell names (which is in accordance with the column names of expr
) and values as annotated cell types. power
is a hyper-parameter controlling how much penalty for the proportion of cells expressing certain gene. parallel
and n.cores
are two parameters for executing function in parallel to accelerate the calculation.
Compared to processing data yourself, a much easier way is to indicate species and brain region or tissue. In this way, marker genes are extracted from ExperimentHub accordingly. For example, we can use mouse Frontal Cortex marker genes to do LRcell analysis on the example bulk experiment[@swarup2019identification]. (The example contains 23, 420 genes along with p-values calculated from DESeq2. Data is processed from a mouse Alzheimer's disease model (GEO: GSE90693), which is 6 months after treatment in Frontal Cortex brain region.)
# load example bulk data data("example_gene_pvals") head(example_gene_pvals, n=5)
Here, we use Linear Regression:
res <- LRcell(gene.p = example_gene_pvals, marker.g = NULL, species = "mouse", region = "FC", method = "LiR") FC_res <- res$FC # exclude leading genes for a better view sub_FC_res <- subset(FC_res, select=-lead_genes) head(sub_FC_res)
Plot out the result:
plot_manhattan_enrich(FC_res, sig.cutoff = .05, label.topn = 5)
According to the result, when using enrichment scores as a predictor variable in Linear Regression, one cluster of Astrocytes (FC_8-2.Astrocytes) is the mostly enriched along with Microglia (FC_11-1.Microglia). (Note that although cluster FC_11-4 was annotated as unknown in the publication, according to our research, FC_11-* belong to Microglia/Macrophage cell types.) Recent publications have shown that Astrocytes are involved in Alzheimer's Disease.
Marker gene list downloading example (mouse, Frontal Cortex, Logistic Regression):
library(ExperimentHub) eh <- ExperimentHub::ExperimentHub() ## use ExperimentHub to download data eh <- query(eh, "LRcellTypeMarkers") enriched_genes <- eh[["EH4548"]] # use title ID which indicates FC region # get marker genes for LRcell in logistic regression FC_marker_genes <- get_markergenes(enriched_genes, method="LR", topn=100) # to have a glance of the marker gene list head(lapply(FC_marker_genes, head))
Then, we can run LRcell analysis by using LRcellCore()
function using Logistic Regression.
res <- LRcellCore(gene.p = example_gene_pvals, marker.g = FC_marker_genes, method = "LR", min.size = 5, sig.cutoff = 0.05) ## curate cell types res$cell_type <- unlist(lapply(strsplit(res$ID, '\\.'), '[', 2)) head(subset(res, select=-lead_genes))
Plot out the result:
plot_manhattan_enrich(res, sig.cutoff = .05, label.topn = 5)
We can clearly find that Microglia (FC_11-1.Microglia) is the most transcriptionally enriched when using Logistic Regression analysis. The result is clear and sound because proliferation and activation of Microglia in the brain is a known feature of Alzheimer's Disease (AD).
LRcell is used to give a hint on which cell types are potentially involved in diseases when a paired scRNA-seq and bulk RNA-seq experiments are unavailable. Thus, LRcell can be applied to the considerable amount of bulk DEG experiments to shed light on biological discoveries.
The gene enrichment score calculation is based on algorithm proposed in [@marques2016oligodendrocyte], which takes both enrichment of genes in certain cell types and fraction of cells expressing the gene into consideration. If interested, please take a look at the Supplementary Materials in the original paper.
LRcell_gene_enriched_scores()
function takes the gene-by-cell expression matrix and a cell-type annotation as input,
which means cell type assignments should be done first. The columns of the expression matrix
should be in accordance with cell names in the annotation vector.
# generate a simulated gene*cell read counts matrix n.row <- 3; n.col <- 10 sim.expr <- matrix(0, nrow=n.row, ncol=n.col) rownames(sim.expr) <- paste0("gene", 1:n.row) colnames(sim.expr) <- paste0("cell", 1:n.col) # generate a simulated annotation for cells sim.annot <- c(rep("celltype1", 3), rep("celltype2", 3), rep("celltype3", 4)) names(sim.annot) <- colnames(sim.expr) sim.expr['gene1', ] <- c(3, 0, 2, 8, 10, 6, 1, 0, 0, 2) # marker gene for celltype2 sim.expr['gene2', ] <- c(7, 5, 8, 1, 0, 5, 2, 3, 2, 1) # marker gene for celltype1 sim.expr['gene3', ] <- c(8, 10, 6, 7, 8, 9, 5, 8, 6, 8) # house keeping
Take a randomly-generated data as example.
# print out the generated expression matrix print(sim.expr) # print out the cell-type annotation print(sim.annot)
This toy example contains 3 genes and 10 cells. As you can tell from the matrix, gene1 is a marker gene of celltype2; gene2 is a marker gene of celltype1; gene3 is a house keeping gene.
# generating the enrichment score enriched_res <- LRcell_gene_enriched_scores(expr = sim.expr, annot = sim.annot, parallel = FALSE) enriched_res
According to the result, it is a gene-by-celltype dataframe indicating the enrichment score of genes in different cell type. Since gene2 is a marker gene of celltype1, thus it has the highest enrichment score.
When choosing marker genes using top 1 as threshold:
marker_res <- get_markergenes(enriched.g = enriched_res, method = "LR", topn=1) marker_res
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.