In Irrationone/cellassign: Automated, probabilistic assignment of scRNA-seq to cell types

knitr::opts_chunk$set(
  collapse = TRUE,
  warnings = FALSE,
  messages = FALSE,
  comment = "#>"
)

suppressPackageStartupMessages({
  library(magrittr)
  library(limma)
  library(org.Hs.eg.db)
  library(edgeR)
  library(matrixStats)
  library(pheatmap)
  library(cellassign)
})

Overview

In many situations, marker genes for cell types are either known a priori as expert knowledge, or can be curated through databases such as the Cellmark database. Alternatively, if purified expression data exists (either in bulk or single-cell form), it is possible to quickly derive marker genes using the findMarkers function in the scran R package.

Below we detail a case study in deriving marker genes through a differential expression approach.

Data

Overview

We take bulk RNA-seq data from Holik et al. Nucleic Acids Research 2017 to derive marker genes for 3 different cell lines. This is packaged with cellassign as holik_data:

data(holik_data)

which contains a matrix of counts, where each row is a gene (index by entrez ID) and each column is a sample:

head(holik_data$counts[,1:2])

as well as a vector with the cell line of origin for each sample:

head(holik_data$cell_line)

Preparation

We first provide a map from entrez IDs to gene symbols:

entrez_map <- select(org.Hs.eg.db, 
                     as.character(rownames(holik_data$counts)), 
                     c("SYMBOL"), "ENTREZID")
gene_annotations <- entrez_map %>%
  dplyr::rename(GeneID=ENTREZID,
                Symbol=SYMBOL)

Then construct the DGEList object for input to limma voom, filtering out lowly expressed genes:

dge <- DGEList(counts = holik_data$counts, 
               group = holik_data$cell_line, 
               genes = gene_annotations, 
               remove.zeros = TRUE)
genes_to_keep <- rowSums(cpm(dge$counts) > 0.5) >= 2
dge_filt <- dge[genes_to_keep,]

and finally calculate the normalization factors:

dge_filt <- calcNormFactors(dge_filt, method="TMM")

Differential expression

We next perform differential expression using Limma Voom on a subset of 3 samples: HCC827, H2228, H1975:

dge_subset <- dge_filt[,dge_filt$samples$group %in% c("HCC827", "H2228", "H1975")]
design <- model.matrix(~ 0+dge_subset$samples$group)
colnames(design) <- levels(dge_subset$samples$group)
v <- voom(dge_subset, design)
fit <- lmFit(v, design)

Next, fit contrasts to find differentially expressed genes between cell types:

contrast.matrix <- makeContrasts(H2228 - H1975, 
                                 HCC827 - H1975, 
                                 HCC827 - H2228, 
                                 levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

Finally, compute gene summary statistics and filter to only significantly differentially expressed geens (FDR < 0.05):

tt <- topTable(fit2, n=Inf)
tt_sig <- tt %>%
  dplyr::filter(adj.P.Val < 0.05)

head(tt_sig)

Marker gene derivation

To derive marker genes, we first create a log fold change matrix using H1975 as the baseline expression:

lfc_table <- tt_sig[,c("H2228...H1975", "HCC827...H1975")]
lfc_table <- lfc_table %>%
  dplyr::mutate(H1975=0,
                H2228=H2228...H1975,
                HCC827=HCC827...H1975) %>%
  dplyr::select(H1975, H2228, HCC827)
rownames(lfc_table) <- tt_sig$GeneID

Then, for each gene, we subtract the minimum log fold change, as we care about overexpression of genes relative to some minimum expression level, as this defines a marker gene:

lfc_table <- as.matrix(lfc_table)
lfc_table <- lfc_table - rowMins(lfc_table)
lfc_table <- as.data.frame(lfc_table)

We now define a helper function for turning log fold changes into a binary matrix. This takes a matrix and a threshold, and any values less than or equal to the threshold are set to 0, and all others to 1:

binarize <- function(x, threshold) {
  x[x <= threshold] <- -Inf
  x[x > -Inf] <- 1
  x[x == -Inf] <- 0
  return(x)
}

Next, we implement a basic procedure for binarizing this matrix. Essentially, we look for the largest 'gap' in expression for each gene, and the cell types with expression above this gap are designated has having that gene as a marker:

# Find the biggest difference
maxdiffs <- apply(lfc_table, 1, function(x) max(diff(sort(x))))

#
thres_vals <- apply(lfc_table, 1, function(x) sort(x)[which.max(diff(sort(x)))])
expr_mat_thres <- plyr::rbind.fill(lapply(1:nrow(lfc_table), function(i) {
  binarize(lfc_table[i,], thres_vals[i])
}))
rownames(expr_mat_thres) <- rownames(lfc_table)
marker_gene_mat <- expr_mat_thres[(maxdiffs >= quantile(maxdiffs, c(.99))) 
                                  & (thres_vals <= log(2)),] %>%
  as.matrix

Finally, we add back in gene symbols rather than entrez ids:

suppressMessages({
  symbols <- plyr::mapvalues(
    rownames(marker_gene_mat),
    from = gene_annotations$GeneID,
    to = gene_annotations$Symbol
  )
})

is_na <- is.na(symbols)

marker_gene_mat <- marker_gene_mat[!is_na,]
rownames(marker_gene_mat) <- symbols[!is_na]

And there we have a marker gene matrix for our cell types:

head(marker_gene_mat)

pheatmap(t(marker_gene_mat))

Note that the expression data used for input to CellAssign should use only these as input.

Technical

sessionInfo()

Irrationone/cellassign documentation built on April 23, 2020, 3:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Irrationone/cellassign
Automated, probabilistic assignment of scRNA-seq to cell types

In Irrationone/cellassign: Automated, probabilistic assignment of scRNA-seq to cell types

Overview

Data

Overview

Preparation

Differential expression

Marker gene derivation

Technical

R Package Documentation

Browse R Packages

We want your feedback!

Irrationone/cellassign Automated, probabilistic assignment of scRNA-seq to cell types

In Irrationone/cellassign: Automated, probabilistic assignment of scRNA-seq to cell types

Overview

Data

Overview

Preparation

Differential expression

Marker gene derivation

Technical

R Package Documentation

Browse R Packages

We want your feedback!

Irrationone/cellassign
Automated, probabilistic assignment of scRNA-seq to cell types