In WeiSong-bio/roryk-bcbioSinglecell: Single-Cell RNA-Seq Utilities

# Last modified 2018-06-05
bcbioSingleCell::prepareSingleCellTemplate()
source("_setup.R")

# Load SingleCellExperiment object
bcb_name <- load(params$bcb_file)
bcb <- get(bcb_name, inherits = FALSE)
stopifnot(is(bcb, "SingleCellExperiment"))
invisible(validObject(bcb))

# Temporarily set to `sampleName`
interesting_groups <- interestingGroups(bcb)
interestingGroups(bcb) <- "sampleName"

# getMethod("sampleData", "SingleCellExperiment")
sampleData(bcb, return = "data.frame")

[bcbio][] run data was imported from r metadata(bcb)$uploadDir.

Reads per cell {.tabset}

These are counts of how many reads are assigned to a given cellular barcode. It is normal for single cell RNA-seq data to contain a large number of low complexity barcodes. The bcbio pipeline filters out most of these barcodes, and here we have applied a threshold cutoff of a minimum of r metadata(bcb)$cellularBarcodeCutoff reads per cell. The unfiltered read count distributions are shown here.

ECDF

An empirical distribution function (ECDF) plot will show the frequency distribution of the reads per cell. You can see that the vast majority of low complexity barcodes plateau at a read depth below 1000 reads per cell.

# getMethod("plotReadsPerCell", "bcbioSingleCell")
plotReadsPerCell(bcb, geom = "ecdf")

Histogram

For high quality data, the proportional histogram should contain a single large peak that represents cells that were encapsulated. If we see a strong shoulder, or a bimodal distribution of the cells, that can indicate a couple problems. It might be that there is free floating RNA, which happens when cells are dying. It could also be that there are a set of cells that failed for some reason. Finally, it could also be that there are biologically different types of cells, and one type is much smaller than the other. If this is the case we would expect to see less RNA being sequenced from the smaller cells.

It looks like there a lot of low complexity barcodes that need to be filtered out, but we can see cells with a usable read depth of at least 10,000 (10^4) reads per cell.

# getMethod("plotReadsPerCell", "bcbioSingleCell")
plotReadsPerCell(bcb, geom = "histogram")

UMI counts per cell {.tabset}

Now let's assess the distribution of unique molecular identifier (UMI)-deconvoluted counts per cell. In general, the distributions should be relatively uniform per sample.

# getMethod("plotReadsPerCell", "bcbioSingleCell")
plotUMIsPerCell(bcb, geom = "histogram")

Now let's rank cellular barcodes based on their UMI count per cell. Note that the scale here is log10. Refer to the DropletUtils::barcodeRanks() documentation or the DropletUtils vignette for more information (see "Computing barcode ranks" section).

The knee and inflection points on the curve here help distinguish the UMI threshold between empty droplets with little RNA and cell-containing droplets with much more RNA. The knee point is more sensitive to noise, and we generally recommend starting with the inflection point for filtering because it is more conservative.

# getMethod("plotBarcodeRanks", "SingleCellExperiment")
plotBarcodeRanks(bcb)

Let's view the UMI knee point cutoffs per sample on a single ECDF plot.

# getMethod("plotUMIsPerCell", "SingleCellExperiment")
plotUMIsPerCell(bcb, geom = "ecdf", point = "knee")

Filter cells by UMI count

Let's apply this step first and then proceed to evaluating gene detection, mitocondrial transcript abundance, and novelty scores.

# getMethod("filterCells", "SingleCellExperiment")
bcb <- filterCells(bcb, minUMIs = params$min_umis)

Let's take a look at the UMI per cell distributions after this filtering step. Note that we haven't applied very strict filtering here — we're going to cut off the "low quality" cells based on the gene detection rate, novelty score, and mitochondrial abundance.

# getMethod("plotUMIsPerCell", "SingleCellExperiment")
plotUMIsPerCell(bcb, geom = "ecdf")
plotUMIsPerCell(bcb, geom = "histogram")

Genes detected per cell {.tabset}

Here by "detected", we mean genes with a non-zero count measurement per cell. Seeing gene detection in the range of 500-5000 is normal for most single-cell experiments.

# getMethod("plotGenesPerCell", "SingleCellExperiment")

markdownHeader("ECDF", level = 2)
plotGenesPerCell(
    object = bcb,
    geom = "ecdf",
    min = min(params$min_genes),
    max = max(params$max_genes)
)

markdownHeader("Histogram", level = 2)
plotGenesPerCell(
    object = bcb,
    geom = "histogram",
    min = min(params$min_genes),
    max = max(params$max_genes)
)

markdownHeader("Violin", level = 2)
plotGenesPerCell(
    object = bcb,
    geom = "violin",
    min = min(params$min_genes),
    max = max(params$max_genes)
)

UMIs vs. genes detected

If we graph out the total number of UMI counts per cell vs. the genes detected per cell, we can assess whether there is a large population of low quality cells with low counts and/or gene detection.

# getMethod("plotUMIsVsGenes", "SingleCellExperiment")
plotUMIsVsGenes(bcb)

Novelty score {.tabset}

Another way to QC the data is to look for less novelty, that is cells that have less genes detected per count than other cells. We can see the samples where we sequenced each cell less have a higher overall novelty, that is because we have not started saturated the sequencing for any given gene for these samples. Outlier cells in these samples might be cells that we have a less complex RNA species than other cells. Sometimes we can detect contamination with low complexity cell types like red blood cells via this metric.

# getMethod("plotNovelty", "SingleCellExperiment")

markdownHeader("ECDF", level = 2)
plotNovelty(
    object = bcb,
    geom = "ecdf",
    min = min(params$min_novelty)
)

markdownHeader("Histogram", level = 2)
plotNovelty(
    object = bcb,
    geom = "histogram",
    min = min(params$min_novelty)
)

markdownHeader("Violin", level = 2)
plotNovelty(
    object = bcb,
    geom = "violin",
    min = min(params$min_novelty)
)

markdownHeader("Ridgeline", level = 2)
plotNovelty(
    object = bcb,
    geom = "ridgeline",
    min = min(params$min_novelty)
)

Mitochondrial abundance {.tabset}

We evaluate overall mitochondrial gene expression as a biomarker of cellular stress during sample preparation.

# getMethod("plotMitoRatio", "SingleCellExperiment")

markdownHeader("ECDF", level = 2)
plotMitoRatio(
    object = bcb,
    geom = "ecdf",
    max = max(params$max_mito_ratio)
)

markdownHeader("Histogram", level = 2)
plotMitoRatio(
    object = bcb,
    geom = "histogram",
    max = max(params$max_mito_ratio)
)

markdownHeader("Violin", level = 2)
plotMitoRatio(
    object = bcb,
    geom = "violin",
    max = max(params$max_mito_ratio)
)

markdownHeader("Ridgeline", level = 2)
plotMitoRatio(
    object = bcb,
    geom = "ridgeline",
    max = max(params$max_mito_ratio)
)

Filter cells

# getMethod("filterCells", "SingleCellExperiment")
bcb <- filterCells(
    object = bcb,
    minGenes = params$min_genes,
    maxGenes = params$max_genes,
    maxMitoRatio = params$max_mito_ratio,
    minNovelty = params$min_novelty,
    minCellsPerGene = params$min_cells_per_gene
)

# getMethod("plotQC", "SingleCellExperiment")
plotQC(
    object = bcb,
    return = "markdown",
    headerLevel = 2
)

interestingGroups(bcb) <- interesting_groups
assignAndSaveData(
    name = paste(bcb_name, "filtered", sep = "_"),
    object = bcb,
    dir = params$data_dir
)