knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
Performing comprehensive quality control (QC) is necessary to remove poor quality cells for downstream analysis of single-cell RNA sequencing (scRNA-seq) data. Therefore, assessment of the data is required, for which various QC algorithms have been developed. In singleCellTK (SCTK), we have written convenience functions for several of these tools. In this guide, we will demonstrate how to use these functions to perform quality control on cell data. (For definition of cell data, please refer this documentation.)
The package can be loaded using the library
command.
library(singleCellTK) library(dplyr)
We will use a filtered form of the PBMC 3K and 6K dataset from the package TENxPBMCData, which is available from the importExampleData()
function. We will combine these datasets together into a single SingleCellExperiment object.
pbmc3k <- importExampleData(dataset = "pbmc3k") pbmc6k <- importExampleData(dataset = "pbmc6k") pbmc.combined <- BiocGenerics::cbind(pbmc3k, pbmc6k) sample.vector = colData(pbmc.combined)$sample
SCTK also supports the importing of single-cell data from the following platforms: 10X CellRanger, STARSolo, BUSTools, SEQC, DropEST, Alevin, as well as dataset already stored in SingleCellExperiment object and AnnData object. To load your own input data, please refer Function Reference for pre-processing tools
under Console Analysis
section in Import data into SCTK for detailed instruction.
SCTK utilizes 2D embedding techniques such as TSNE and UMAP for visualizing single-cell data. Users can modify the dimensions by adjusting the parameters within the function. The logNorm
parameter should be set to TRUE
for normalization prior to generating the 2D embedding.
The sample
parameter may be specified if multiple samples exist in the SingleCellExperiment
object. Here, we will use the sample vector stored in the colData
slot of the SingleCellExperiment
object.
pbmc.combined <- runQuickUMAP(pbmc.combined)
All of the droplet-based QC algorithms are able to be run under the wrapper function runCellQC()
. By default all possible QC algorithms will be run.
Users may set a sample
parameter if need to compare between multiple samples. Here, we will use the sample vector stored in the SingleCellExperiment
object.
If users wishes, a list of gene sets can be applied to the function to determine the expression of a set of specific genes. A gene list imported into the SingleCellExperiment
object using importGeneSets*
functions can be set as collectionName
. Additionally, a pre-made list of genes can be used to determine the level of gene expression per cell. A list containing gene identifiers may be set as geneSetList
, or the user may instead use the geneSetCollection
parameter to supply a GeneSetCollection
object from the GSEABase package. Please also refer to Import Genesets documentation.
pbmc.combined <- importGeneSetsFromGMT(inSCE = pbmc.combined, collectionName = "mito", file = system.file("extdata/mito_subset.gmt", package = "singleCellTK")) set.seed(12345) pbmc.combined <- runCellQC(pbmc.combined, algorithms = c("QCMetrics", "scrublet", "scDblFinder", "cxds", "bcds", "cxds_bcds_hybrid", "doubletFinder", "decontX", "soupX"), sample = sample.vector, collectionName = "mito")
Users can also specify mitoRef
, mitoIDType
and mitoGeneLocation
arguments in runCellQC
function to quantify mitochondrial gene expression without the need to import gene sets. For the details about these arguments, please refer to runCellQC
and runPerCellQC()
.
pbmc.combined <- runCellQC(pbmc.combined, algorithms = c("QCMetrics", "scrublet", "scDblFinder", "cxds", "bcds", "cxds_bcds_hybrid", "doubletFinder", "decontX", "soupX"), sample = sample.vector, mitoRef = "human", mitoIDType = "symbol", mitoGeneLocation = "rownames")
If users choose to only run a specific set of algorithms, they can specify which to run with the algorithms
parameter. By default, the runCellQC()
will run "QCMetrics"
, "scDblFinder"
, "cxds"
, "bcds"
, "cxds_bcds_hybrid"
, "decontX"
and "soupX"
algorithms by default. Besides, "scrublet"
and "doubletFinder"
are supported if our users want to run them.
After running QC functions with SCTK, the output will be stored in the colData
slot of the SingleCellExperiment
object.
head(colData(pbmc.combined), 5)
df.matrix <- head(colData(pbmc.combined), 2) df.matrix %>% knitr::kable(format = "html") %>% kableExtra::kable_styling() %>% kableExtra::scroll_box(width = "80%")
````{=html}
A summary of all outputs
```r tabl <- " | QC output | Description | Methods | Package/Tool | |----------------------------------------|:--------------------------------------------------------------|:------------------|:--------------| | `sum` | Total counts | `runPerCellQC()` | [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) | | `detected` | Total features | `runPerCellQC()` | [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) | | `percent_top` | % Expression coming from top features | `runPerCellQC()` | [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) | | `subsets_*` | sum, detected, percent_top calculated on specified gene list | `runPerCellQC()` | [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) | | `scrublet_score` | Doublet score | `runScrublet()` | [scrublet](https://github.com/swolock/scrublet) | | `scrublet_call` | Doublet classification based on threshold | `runScrublet()` | [scrublet](https://github.com/swolock/scrublet) | | `scDblFinder_doublet_score` | Doublet score | `runScDblFinder()` | [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html) | | `doubletFinder_doublet_score` | Doublet score | `runDoubletFinder()` | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) | | `doubletFinder_doublet_label_resolution` | Doublet classification based on threshold | `runDoubletFinder()` | [DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) | | `scds_cxds_score` | Doublet score | `runCxds()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `scds_cxds_call` | Doublet classification based on threshold | `runCxds()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `scds_bcds_score` | Doublet score | `runBcds()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `scds_bcds_call` | Doublet classification based on threshold | `runBcds()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `scds_hybrid_score` | Doublet score | `runCxdsBcdsHybrid()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `scds_hybrid_call` | Doublet classification based on threshold | `runCxdsBcdsHybrid()` | [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) | | `decontX_contamination` | Ambient RNA contamination | `runDecontX()` | [celda](https://www.bioconductor.org/packages/release/bioc/html/celda.html) | | `decontX_clusters` | Clusters determined in dataset based on underlying algorithm | `runDecontX()` | [celda](https://www.bioconductor.org/packages/release/bioc/html/celda.html) | | `soupX_nUMIs` | Total number of UMI per cell | `runSoupX()` | [SoupX](https://github.com/constantAmateur/SoupX) | | `soupX_clusters` | Quick clustering label if clustering not provided by users | `runSoupX()` | [scran](https://rdrr.io/bioc/scran/man/quickCluster.html) | | `soupX_contamination` | Ambient RNA contamination | `runSoupX()` | [SoupX](https://github.com/constantAmateur/SoupX) | " cat(tabl) ``` ````{=html} </details>
The names of the 2D embedding and dimension reduction matrices are stored in the reducedDims
slot of the SingleCellExperiment
object.
reducedDims(pbmc.combined)
The function sampleSummaryStats()
may be used to generate a table containing the mean and median of the data per sample, which is stored within the qc_table
table under metadata
. The table can then be returned using getSampleSummaryStatsTable
.
pbmc.combined <- sampleSummaryStats(pbmc.combined, sample = sample.vector) getSampleSummaryStatsTable(pbmc.combined, statsName = "qc_table")
If users choose to generate a table for all QC metrics generated through runCellQC()
, they may set the simple
parameter to FALSE
.
pbmc.combined <- sampleSummaryStats(pbmc.combined, sample = sample.vector, simple = FALSE) getSampleSummaryStatsTable(pbmc.combined, statsName = "qc_table")
Instead of running all quality control methods on the dataset at once, users may elect to execute QC methods individually. The parameters as well as the outputs to individual QC functions are described in detail as follows:
````{=html}
General QC metrics
runPerCellQC
SingleCellTK utilizes the [scater](https://bioconductor.org/packages/release/bioc/html/scater.html) package to compute cell-level QC metrics. The wrapper function `runPerCellQC()` can be used to separately compute general QC metrics on its own. - `inSCE` parameter is the input SingleCellExperiment object. - `useAssay` is the assay object that in the SingleCellExperiment object the user wishes to use. A list of gene sets can be applied to the function to determine the expression of a set of specific genes, as mentioned before. Please also refer to [*Import Genesets* documentation](import_genesets.html). The QC outputs are `sum`, `detected`, and `percent_top_X`, stored as variables in `colData`. - `sum` contains the total number of counts for each cell. - `detected` contains the total number of features for each cell. - `percent_top_X` contains the percentage of the total counts that is made up by the expression of the top X genes for each cell. - The `subsets_` columns contain information for the specific gene list that was used. For instance, if a gene list containing ribosome genes named `"ribosome"` was used, `subsets_ribosome_sum` would contain the total number of ribosome gene counts for each cell. - `mito_sum`, `mito_detected` and `mito_percent` contains number of counts, number of mito features and percentage of mito gene expression of each cells. These columns will show up only if you specify arguments related to mito genes quantification in `runCellQC` function. Please refer to `runCellQC` and `runPerCellQC` documentation for more details. ```r pbmc.combined <- runPerCellQC( inSCE = pbmc.combined, useAssay = "counts", collectionName = "ribosome", mitoRef = "human", mitoIDType = "symbol", mitoGeneLocation = "rownames") ``` ````{=html} </details> </details> <details> <summary><b>Doublet Detection</b></summary>
Doublets hinder cell-type identification by appearing as a distinct transcriptomic state, and need to be removed for downstream analysis. SCTK contains various doublet detection tools that the user may choose from.
````{=html}
runScrublet
[Scrublet](https://github.com/swolock/scrublet) aims to detect doublets by creating simulated doublets from combining transcriptomic profiles of existing cells in the dataset. The wrapper function `runScrublet()` can be used to separately run the Scrublet algorithm on its own. - `sample` indicates what sample each cell originates from. It can be set to `NULL` if all cells in the dataset came from the same sample. Scrublet also has a large set of parameters that the user can adjust, please see the function reference for detail, by clicking on the function name. The Scrublet outputs include the following `colData` variables: - `scrublet_score`, which is a numeric variable of the likelihood that a cell is a doublet - `scrublet_call`, which is the assignment of whether the cell is a doublet. ```r pbmc.combined <- runScrublet( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, useAssay = "counts" ) ``` ````{=html} </details> <details> <summary><b>runScDblFinder</b></summary>
ScDblFinder is a doublet detection algorithm. ScDblFinder aims to detect doublets by creating a simulated doublet from existing cells and projecting it to the same PCA space as the cells. The wrapper function runScDblFinder()
can be used to separately run the ScDblFinder algorithm on its own.
nNeighbors
is the number of nearest neighbor used to calculate the density for doublet detection. simDoublets
is used to determine the number of simulated doublets used for doublet detection.The output of ScDblFinder is a scDblFinder_doublet_score
, which will be stored as a colData
variable. The doublet score of a droplet will be higher if the it is deemed likely to be a doublet.
pbmc.combined <- runScDblFinder(inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, useAssay = "counts")
````{=html}
runDoubletFinder
[DoubletFinder](https://github.com/chris-mcginnis-ucsf/DoubletFinder) is a doublet detection algorithm which depends on the single cell analysis package [Seurat](https://satijalab.org/seurat/). The wrapper function `runDoubletFinder()` can be used to separately run the DoubletFinder algorithm on its own. - `seuratRes` - `runDoubletFinder()` relies on a parameter (in Seurat) called "resolution" to determine cells that may be doublets. Users will be able to manipulate the resolution parameter through `seuratRes`. If multiple numeric vectors are stored in `seuratRes`, there will be multiple label/scores. - `seuratNfeatures` determines the number of features that is used in the [`FindVariableFeatures`](https://satijalab.org/seurat/reference/findvariablefeatures) function in Seurat. - `seuratPcs` determines the number of dimensions used in the [`FindNeighbors`](https://satijalab.org/seurat/reference/findneighbors) function in Seurat. - `formationRate` is the estimated doublet detection rate in the dataset. It aims to detect doublets by creating simulated doublets from combining transcriptomic profiles of existing cells in the dataset. The DoubletFinder outputs include the following `colData` variable: - `doubletFinder_doublet_score`, which is a numeric variable of the likelihood that a cell is a doublet - `doubletFinder_doublet_label`, which is the assignment of whether the cell is a doublet. ```r pbmc.combined <- runDoubletFinder( inSCE = pbmc.combined, useAssay = "counts", sample = colData(pbmc.combined)$sample, seuratRes = c(1.0), seuratPcs = 1:15, seuratNfeatures = 2000, formationRate = 0.075, seed = 12345 ) ``` ````{=html} </details> <details> <summary><b>runCXDS</b></summary>
CXDS, or co-expression based doublet scoring, is an algorithm in the SCDS package which employs a binomial model for the co-expression of pairs of genes to determine doublets. The wrapper function runCxds()
can be used to separately run the CXDS algorithm on its own.
ntop
is the number of top variance genes to consider.binThresh
is the minimum counts a gene needs to have to be included in the analysis. verb
determines whether progress messages will be displayed or not. retRes
will determine whether the gene pair results should be returned or not. estNdbl
is the user estimated number of doublets.The output of runCxds()
is the doublet score, scds_cxds_score
, which will be stored as a colData
variable.
pbmc.combined <- runCxds( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, ntop = 500, binThresh = 0, verb = FALSE, retRes = FALSE, estNdbl = FALSE )
````{=html}
runBCDS
BCDS, or binary classification based doublet scoring, is an algorithm in the [SCDS](https://www.bioconductor.org/packages/release/bioc/html/scds.html) package which uses a binary classification approach to determine doublets. The wrapper function `runBcds()` can be used to separately run the BCDS algorithm on its own. - `ntop` is the number of top variance genes to consider. - `srat` is the ratio between original number of cells and simulated doublets. - `nmax` is the maximum number of cycles that the algorithm should run through. If set to `"tune"`, this will be automatic. - `varImp` determines if the variable importance should be returned or not. The output of `runBcds()` is `scds_bcds_score`, which is the likelihood that a cell is a doublet and will be stored as a `colData` variable. ```r pbmc.combined <- runBcds( inSCE = pbmc.combined, seed = 12345, sample = colData(pbmc.combined)$sample, ntop = 500, srat = 1, nmax = "tune", varImp = FALSE ) ``` ````{=html} </details> <details> <summary><b>runCxdsBcdsHybrid</b></summary>
The CXDS-BCDS hybrid algorithm, uses both CXDS and BCDS algorithms from the SCDS package. The wrapper function runCxdsBcdsHybrid()
can be used to separately run the CXDS-BCDS hybrid algorithm on its own.
All parameters from the runCxds()
and runBcds()
functions may be applied to this function in the cxdsArgs
and bcdsArgs
parameters, respectively.
The output of runCxdsBcdsHybrid()
is the doublet score, scds_hybrid_score
, which will be stored as a colData
variable.
pbmc.combined <- runCxdsBcdsHybrid( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, seed = 12345, nTop = 500 )
````{=html}
Ambient RNA Detection
runDecontX
In droplet-based single cell technologies, ambient RNA that may have been released from apoptotic or damaged cells may get incorporated into another droplet, and can lead to contamination. [decontX](https://rdrr.io/bioc/celda/man/decontX.html), available from [celda](https://bioconductor.org/packages/release/bioc/html/celda.html), is a Bayesian method for the identification of the contamination level at a cellular level. The wrapper function `runDecontX()` can be used to separately run the DecontX algorithm on its own. The outputs of `runDecontX()` are `decontX_contamination` and `decontX_clusters`. - `decontX_contamination` is a numeric vector which characterizes the level of contamination in each cell. - Clustering is performed as part of the `runDecontX()` algorithm. `decontX_clusters` is the resulting cluster assignment, which can also be labeled on the plot. For performing fine-tuned clustering in SCTK, please refer to [*Clustering* documentation](clustering.html) ```r pbmc.combined <- runDecontX( inSCE = pbmc.combined, useAssay = "counts" sample = colData(pbmc.combined)$sample ) ``` ````{=html} </details> <details> <summary><b>runSoupX</b></summary>
In droplet-based single cell technologies, ambient RNA that may have been released from apoptotic or damaged cells may get incorporated into another droplet, and can lead to contamination. SoupX uses non-expressed genes to estimates a global contamination fraction. The wrapper function runSoupX()
can be used to separately run the SoupX algorithm on its own.
he main outputs of runSoupX
are soupX_contamination
, soupX_clusters
, and the corrected assay SoupX
, together with other intermediate metrics that SoupX generates.
soupX_contamination
is a numeric vector which characterizes the level of contamination in each cell. SoupX generates one global contamination estimate per sample, instead of returning cell-specific estimation.quickCluster()
method from package scran is adopted for this purpose. soupX_clusters
is the resulting cluster assignment, which can also be labeled on the plot. For performing fine-tuned clustering in SCTK, please refer to Clustering documentationpbmc.combined <- runSoupX( inSCE = pbmc.combined, useAssay = "counts" sample = colData(pbmc.combined)$sample )
````{=html}
### Plotting QC metrics Upon running `runCellQC()` or any individual QC methods, the QC outputs will need to be plotted. For each QC method, SCTK provides a specialized plotting function. ````{=html} <details> <summary><b>General QC metrics</b></summary> <details> <summary><b>runPerCellQC</b></summary>
The wrapper function plotRunPerCellQCResults()
can be used to plot the general QC outputs.
runpercellqc.results <- plotRunPerCellQCResults(inSCE = pbmc.combined, sample = sample.vector, combinePlot = "all", axisSize = 8, axisLabelSize = 9, titleSize = 20, labelSamples=TRUE)
runpercellqc.results
````{=html}
````{=html} <details> <summary><b>Doublet Detection</b></summary> <details> <summary><b>Scrublet</b></summary>
The wrapper function plotScrubletResults()
can be used to plot the results from the Scrublet algorithm. Here, we will use the UMAP coordinates generated from runQuickUMAP()
in previous sections.
reducedDims(pbmc.combined)
scrublet.results <- plotScrubletResults( inSCE = pbmc.combined, reducedDimName = "UMAP", sample = colData(pbmc.combined)$sample, combinePlot = "all", titleSize = 10, axisLabelSize = 8, axisSize = 10, legendSize = 10, legendTitleSize = 10 )
scrublet.results
````{=html}
ScDblFinder
The wrapper function `plotScDblFinderResults()` can be used to plot the QC outputs from the ScDblFinder algorithm. ```r scDblFinder.results <- plotScDblFinderResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 13, axisLabelSize = 13, axisSize = 13, legendSize = 13, legendTitleSize = 13 ) ``` <br /> ````{=html} </details> <details> <summary><b>DoubletFinder</b></summary>
The wrapper function plotDoubletFinderResults()
can be used to plot the QC outputs from the DoubletFinder algorithm.
doubletFinderResults <- plotDoubletFinderResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 13, axisLabelSize = 13, axisSize = 13, legendSize = 13, legendTitleSize = 13 )
````{=html}
SCDS, CXDS
The wrapper function `plotCxdsResults()` can be used to plot the QC outputs from the CXDS algorithm. ```r cxdsResults <- plotCxdsResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 13, axisLabelSize = 13, axisSize = 13, legendSize = 13, legendTitleSize = 13 ) ``` <br /> ````{=html} </details> <details> <summary><b>SCDS, BCDS</b></summary>
The wrapper function plotBcdsResults()
can be used to plot the QC outputs from the BCDS algorithm
bcdsResults <- plotBcdsResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 13, axisLabelSize = 13, axisSize = 13, legendSize = 13, legendTitleSize = 13 )
````{=html}
SCDS, CXDS-BCDS hybrid
The wrapper function `plotScdsHybridResults()` can be used to plot the QC outputs from the CXDS-BCDS hybrid algorithm. ```r bcdsCxdsHybridResults <- plotScdsHybridResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 13, axisLabelSize = 13, axisSize = 13, legendSize = 13, legendTitleSize = 13 ) ``` <br /> ````{=html} </details>
````{=html}
````{=html} <details> <summary><b>Ambient RNA Detection</b></summary> <details> <summary><b>DecontX</b></summary>
The wrapper function plotDecontXResults()
can be used to plot the QC outputs from the DecontX algorithm.
decontxResults <- plotDecontXResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 8, axisLabelSize = 8, axisSize = 10, legendSize = 5, legendTitleSize = 7, relWidths = c(0.5, 1, 1), sampleRelWidths = c(0.5, 1, 1), labelSamples = TRUE, labelClusters = FALSE )
decontxResults
````{=html}
SoupX
The wrapper function `plotSoupXResults()` can be used to plot the QC outputs from the SoupX algorithm. ```r soupxResults <- plotSoupXResults( inSCE = pbmc.combined, sample = colData(pbmc.combined)$sample, reducedDimName = "UMAP", combinePlot = "all", titleSize = 8, axisLabelSize = 8, axisSize = 10, legendSize = 5, legendTitleSize = 7, labelClusters = FALSE ) ``` ```r soupxResults ``` ````{=html} </details </details>
SingleCellExperiment
objects can be subset by its colData
using subsetSCECols()
. The colData
parameter takes in a character vector of expression(s) which will be used to identify a subset of cells using variables found in the colData
of the SingleCellExperiment
object. For example, if x
is a numeric vector in colData
, then setting colData = "x < 5"
will return a SingleCellExperiment
object where all columns (cells) meet the condition that x
is less than 5. The index
parameter takes in a numeric vector of indices which should be kept, while bool
takes in a logical vector of TRUE
or FALSE
which should be of the same length as the number of columns (cells) in the SingleCellExperiment
object. Please refer to our Filtering documentation for detail.
#Before filtering: dim(pbmc.combined)
Remove barcodes with high mitochondrial gene expression:
pbmc.combined <- subsetSCECols(pbmc.combined, colData = 'mito_percent < 20')
Remove detected doublets from Scrublet:
pbmc.combined <- subsetSCECols(pbmc.combined, colData = 'scrublet_call == "Singlet"')
Remove cells with high levels of ambient RNA contamination:
pbmc.combined <- subsetSCECols(pbmc.combined, colData = 'decontX_contamination < 0.5')
#After filtering: dim(pbmc.combined)
For performing QC on droplet-level raw count matrix with SCTK, please refer to our Droplet QC documentation.
````{=html}
Session Information
```r sessionInfo() ``` ````{=html} </details>
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.