suppressPackageStartupMessages(library(magrittr))
if (rlang::is_true(getOption("knitr.in.progress"))) {
  params_ <- scdrake::scdrake_list(params)
}
drake_cache_dir <- params_$drake_cache_dir

drake::loadd(
  config_main, config_input_qc, empty_droplets, sce_valid_cells_info, barcode_ranks,
  qc_filter, custom_filter, sce_qc_filter_rowSums, sce_custom_filter_rowSums,
  path = drake_cache_dir
)

cfg <- config_input_qc
empty_droplets_enabled <- cfg$EMPTY_DROPLETS_ENABLED
cell_filtering_enabled <- cfg$ENABLE_CELL_FILTERING
gene_filtering_enabled <- cfg$ENABLE_GENE_FILTERING

input_type <- cfg$INPUT_DATA$type
filtering_type <- ifelse(cfg$SAVE_DATASET_SENSITIVE_FILTERING, "dataset-sensitive", "custom")



if (input_type == "cellranger") {
  scdrake::md_header("Input data: 10x Genomics Cell Ranger data", 1)
  cat(scdrake::str_space(
    "The feature-barcode matrix was imported from",
    "[Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)",
    "output (the official quantification tool from 10x Genomics)."
  ))
} else if (input_type == "table") {
  scdrake::md_header("Input data: delimited text (table)", 1)
  cat("The feature-barcode matrix was imported from a delimited file.")
} else if (input_type == "sce") {
  scdrake::md_header("Input data: `SingleCellExperiment` object", 1)
  cat("The object holding experimental data (feature-barcode matrix, gene annotation etc.) was imported from a Rds file.")
}

Each row of feature-barcode matrix corresponds to a gene, while each column corresponds to a cell barcode. Summary of imported data:

cat(drake::readd(sce_raw_info, path = drake_cache_dir)$str)

r scdrake::format_used_functions("DropletUtils::read10xCounts()")


Empty droplets

Empty droplets often contain RNA from the ambient solution, resulting in non-zero counts after debarcoding. It is desired to discard such droplets.

Barcode rank plot

A useful diagnostic for droplet-based data is the barcode rank plot, which shows the total UMI (log-)count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows examine the distribution of total UMI counts across barcodes, focusing on those with the largest counts.

uniq <- !duplicated(barcode_ranks$rank)
plot(barcode_ranks$rank[uniq], barcode_ranks$total[uniq], log = "xy", xlab = "Rank", ylab = "Total")
o <- order(barcode_ranks$rank)
lines(barcode_ranks$rank[o], barcode_ranks$fitted[o], col = "red")

abline(h = metadata(barcode_ranks)$knee, col = "dodgerblue", lty = 2)
abline(h = metadata(barcode_ranks)$inflection, col = "forestgreen", lty = 2)
if (empty_droplets_enabled) {
  abline(h = cfg$EMPTY_DROPLETS_LOWER, col = "firebrick", lty = 2)
  legend(
    "bottomleft",
    lty = 2,
    col = c("dodgerblue", "forestgreen", "firebrick"),
    legend = c("knee", "inflection", "emptyDroplets lower bound")
  )
} else {
  legend(
    "bottomleft",
    lty = 2,
    col = c("dodgerblue", "forestgreen"),
    legend = c("knee", "inflection")
  )
}

The knee and inflection points on the curve mark the transition between two components of the total UMI count distribution. This is assumed to represent the difference between empty droplets with little RNA and cell-containing droplets with much more RNA.

if (empty_droplets_enabled) {
  cat(
    "The emptyDroplets lower bound specifies at or below which number of the total UMI count all barcodes",
    "are assumed to correspond to empty droplets."
  )
} else {
  cat("Removal of empty droplets was disabled. You can enable it by setting `EMPTY_DROPLETS_ENABLED` parameter to `TRUE`.")
}

```r)}

***

# Gene + Cell quality filtering

## Pre-filtering QC

Given sets of mitochondrial and ribosomal genes in the data, the `scater` package automatically calculates
several per-cell QC metrics:

- Number of UMI.
- Number of detected genes (non-zero UMI count).
- Percentage of expressed mitochondrial genes ($\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100$).

Then we can use two different methods to filter cells based on the metrics above:

- **Custom filtering**: a standard approach is to filter cells with low amount of reads as well as genes that are
  present in at least a certain amount of cells, using fixed thresholds. While simple, using fixed thresholds requires
  knowledge of the experiment and of the experimental protocol.
- **Dataset-sensitive filtering**: an alternative approach is to use adaptive, data-driven thresholds to identify
  outlying cells, based on the set of QC metrics just calculated. We identify cells that are outliers for the various
  QC metrics, based on the median absolute deviation (MAD) from the median value of each QC metric across all cells.
  Specifically, a value is considered an outlier if it is more than `r cfg$MAD_THRESHOLD` MADs from the median in
  the "problematic" direction.

Additionaly, extremely high number of detected genes could indicate doublets (more sensitive doublet detection is
done after library normalization). However, depending on the cell type composition in your sample,
you may have cells with higher number of genes (and also higher counts) from one cell type.

Now we can plot some of the QC features. Cells are colored by `discard_qc`, meaning if a cell would be discarded by
MAD thresholding on a QC metric.

```r
cowplot::plot_grid(plotlist = drake::readd(sce_unfiltered_plotlist, path = drake_cache_dir), ncol = 2)

r scdrake::format_used_functions("scuttle::perCellQCMetrics()")

Filtering {.tabset}

Dataset-sensitive filters

Cell filtering

```r)}

```r)}
cat("Cell filtering was disabled.")

Gene filtering

```r)}

```r)}
cat("Gene filtering was disabled.")

Custom filters

Cell filtering

```r)}

```r)}
cat("Cell filtering was disabled.")

Gene filtering

```r)}

```r)}
cat("Gene filtering was disabled.")

Post-filtering QC

Final filtering selection: using r filtering_type filtering.

cat(drake::readd(sce_final_input_qc_info, path = drake_cache_dir)$str)

Cell and gene number history

scdrake::render_bootstrap_table(drake::readd(sce_history, path = drake_cache_dir), full_width = FALSE, position = "left")
print(drake::readd(sce_history_plot, path = drake_cache_dir))

Dataset-sensitive filtering

Plots of QC metrics after dataset-sensitive filtering. discard_custom means if given cell was discarded in custom filtering.

cowplot::plot_grid(plotlist = drake::readd(sce_qc_filter_genes_plotlist, path = drake_cache_dir), ncol = 2)

Filtering based on custom filters

Plots of QC metrics after custom filtering. discard_qc means if given cell was discarded in dataset-sensitive filtering.

cowplot::plot_grid(plotlist = drake::readd(sce_custom_filter_genes_plotlist, path = drake_cache_dir), ncol = 2)

Gene annotation

drake::readd(gene_annotation, path = drake_cache_dir) %>%
  head() %>%
  scdrake::render_bootstrap_table()


Show input parameters

Main config

print(config_main)


Input QC config

print(cfg)





bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.