knitr::opts_chunk$set( collapse = TRUE, warnings = FALSE, messages = FALSE, comment = "#>" )
suppressPackageStartupMessages({ library(magrittr) library(limma) library(org.Hs.eg.db) library(edgeR) library(matrixStats) library(pheatmap) library(cellassign) })
In many situations, marker genes for cell types are either known a priori
as expert knowledge, or can be curated through databases such as the Cellmark database. Alternatively,
if purified expression data exists (either in bulk or single-cell form), it
is possible to quickly derive marker genes using the findMarkers
function
in the scran
R package.
Below we detail a case study in deriving marker genes through a differential expression approach.
We take bulk RNA-seq data from Holik et al. Nucleic Acids Research 2017 to derive
marker genes for 3 different cell lines. This is packaged with cellassign
as holik_data
:
data(holik_data)
which contains a matrix of counts, where each row is a gene (index by entrez ID) and each column is a sample:
head(holik_data$counts[,1:2])
as well as a vector with the cell line of origin for each sample:
head(holik_data$cell_line)
We first provide a map from entrez IDs to gene symbols:
entrez_map <- select(org.Hs.eg.db, as.character(rownames(holik_data$counts)), c("SYMBOL"), "ENTREZID") gene_annotations <- entrez_map %>% dplyr::rename(GeneID=ENTREZID, Symbol=SYMBOL)
Then construct the DGEList
object for input to limma voom
,
filtering out lowly expressed genes:
dge <- DGEList(counts = holik_data$counts, group = holik_data$cell_line, genes = gene_annotations, remove.zeros = TRUE) genes_to_keep <- rowSums(cpm(dge$counts) > 0.5) >= 2 dge_filt <- dge[genes_to_keep,]
and finally calculate the normalization factors:
dge_filt <- calcNormFactors(dge_filt, method="TMM")
We next perform differential expression using Limma Voom on a subset of 3 samples: HCC827, H2228, H1975:
dge_subset <- dge_filt[,dge_filt$samples$group %in% c("HCC827", "H2228", "H1975")] design <- model.matrix(~ 0+dge_subset$samples$group) colnames(design) <- levels(dge_subset$samples$group) v <- voom(dge_subset, design) fit <- lmFit(v, design)
Next, fit contrasts to find differentially expressed genes between cell types:
contrast.matrix <- makeContrasts(H2228 - H1975, HCC827 - H1975, HCC827 - H2228, levels = design) fit2 <- contrasts.fit(fit, contrast.matrix) fit2 <- eBayes(fit2)
Finally, compute gene summary statistics and filter to only significantly differentially expressed geens (FDR < 0.05):
tt <- topTable(fit2, n=Inf) tt_sig <- tt %>% dplyr::filter(adj.P.Val < 0.05) head(tt_sig)
To derive marker genes, we first create a log fold change matrix using H1975 as the baseline expression:
lfc_table <- tt_sig[,c("H2228...H1975", "HCC827...H1975")] lfc_table <- lfc_table %>% dplyr::mutate(H1975=0, H2228=H2228...H1975, HCC827=HCC827...H1975) %>% dplyr::select(H1975, H2228, HCC827) rownames(lfc_table) <- tt_sig$GeneID
Then, for each gene, we subtract the minimum log fold change, as we care about overexpression of genes relative to some minimum expression level, as this defines a marker gene:
lfc_table <- as.matrix(lfc_table) lfc_table <- lfc_table - rowMins(lfc_table) lfc_table <- as.data.frame(lfc_table)
We now define a helper function for turning log fold changes into a binary matrix. This takes a matrix and a threshold, and any values less than or equal to the threshold are set to 0, and all others to 1:
binarize <- function(x, threshold) { x[x <= threshold] <- -Inf x[x > -Inf] <- 1 x[x == -Inf] <- 0 return(x) }
Next, we implement a basic procedure for binarizing this matrix. Essentially, we look for the largest 'gap' in expression for each gene, and the cell types with expression above this gap are designated has having that gene as a marker:
# Find the biggest difference maxdiffs <- apply(lfc_table, 1, function(x) max(diff(sort(x)))) # thres_vals <- apply(lfc_table, 1, function(x) sort(x)[which.max(diff(sort(x)))]) expr_mat_thres <- plyr::rbind.fill(lapply(1:nrow(lfc_table), function(i) { binarize(lfc_table[i,], thres_vals[i]) })) rownames(expr_mat_thres) <- rownames(lfc_table) marker_gene_mat <- expr_mat_thres[(maxdiffs >= quantile(maxdiffs, c(.99))) & (thres_vals <= log(2)),] %>% as.matrix
Finally, we add back in gene symbols rather than entrez ids:
suppressMessages({ symbols <- plyr::mapvalues( rownames(marker_gene_mat), from = gene_annotations$GeneID, to = gene_annotations$Symbol ) }) is_na <- is.na(symbols) marker_gene_mat <- marker_gene_mat[!is_na,] rownames(marker_gene_mat) <- symbols[!is_na]
And there we have a marker gene matrix for our cell types:
head(marker_gene_mat)
pheatmap(t(marker_gene_mat))
Note that the expression data used for input to CellAssign
should
use only these as input.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.