``` {r, echo = FALSE} knitr::opts_chunk$set(eval = FALSE)
```r ## -- We will initialize a scdrake project in temporary directory, but the user will do it in home. tmp_dir <- tempfile() scdrake::init_project( tmp_dir, set_active_project = FALSE, set_wd = FALSE, ask = FALSE, download_example_data = TRUE )
{scdrake}
offers two pipelines - one for single-sample, and second one for integration of multiple samples
(which were processed by the single-sample pipeline before). As for now, each pipeline consists of
two subpipelines (referred to as stages), and two stages common to both single-sample and integration pipelines.
Details, along with graphical structure, are available in vignette("pipeline_overview")
.
In this guide you will see how to quickly setup {scdrake}
and run the first stage (01_input_qc
) of its single-sample
pipeline using the provided example data. Once you familiarize yourself with {scdrake}
basics, we will guide you through
the second stage of the single-sample pipeline (02_norm_clustering
) that demonstrates cell clustering and annotation.
Later you can continue with guide for the integration pipeline in vignette("scdrake_integration")
.
If possible, we will show how to follow the steps from within R and through the command line interface
(CLI, see vignette("scdrake_cli")
for a short overview).
We strongly recommend to use the {scdrake}
's Docker image as it is well tested. Even if you are familiar with
Docker or Singularity, please, refer to vignette("scdrake_docker")
as there are given some critical parameters.
After installation, you basically need the three steps described below
to run the {scdrake}
pipeline.
{scdrake}
is a project-based package, so the first step is to initialize a new project.
Project simply means a directory in which the analysis of your data will take place. That also means:
output/plots/figure1.pdf
).
This way your project is transferable between different computers and data locations.Now we initialize a new {scdrake}
project directory.
First we load the {scdrake}
package, and then call the init_project()
function:
library(scdrake) init_project("~/scdrake_projects/pbmc1k", download_example_data = TRUE)
mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
scdrake --download-example-data init-project
We assume that you are running a detached container that has a shared directory mounted as /home/rstudio/scdrake_projects
(as described in vignette("scdrake_docker")
).
mkdir ~/scdrake_projects/pbmc1k cd ~/scdrake_projects/pbmc1k docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \ scdrake --download-example-data init-project
mkdir -p ~/scdrake_singularity cd ~/scdrake_singularity mkdir -p home/${USER} scdrake_projects/pbmc1k singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc1k" \ path/to/scdrake_image.sif \ scdrake --download-example-data init-project
This will:
pbmc1k
in /home/<user>/scdrake_projects
{scdrake}
package: default YAML configs, Rmd files, R scripts etc.RProj
) file and set it as the active project (can be disabled)library(scdrake)
again.Important: whenever you will be running the {scdrake}
pipeline, make sure your working directory is set to the
project's root. Although you can specify the project directory in CLI using the -d / --dir
parameter,
we rather recommend to always be present in the project's root directory before you issue {scdrake}
commands
(this also conforms the project-based approach). You can notice in the examples that the -d
parameter is never used.
Let's inspect files in the project directory:
fs::dir_tree(all = TRUE)
If you have the tree
tool installed (sudo apt install tree
to fix it):
tree
Otherwise:
Rscript -e 'fs::dir_tree("~/scdrake_projects/pbmc1k")'
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \ Rscript -e 'fs::dir_tree()'
singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc1k" \ path/to/scdrake_image.sif \ Rscript -e 'fs::dir_tree()'
fs::dir_tree(tmp_dir, all = TRUE)
The most important files and directories are:
config/
: configuration files in YAML format. Each pipeline (single-sample and integration) and stage has its own
config file, plus there is a general pipeline.yaml
file with runtime parameters.*.default.yaml
) are bundled with
the package and used to supply new parameters to local configs (*.yaml
) when such events appear.
See vignette("scdrake_config")
for more details.Rmd/
: RMarkdown files used for reporting of pipeline results. Each stage has its own report Rmd file and
you can edit it according to your needs._drake_integration.R
, _drake_single_sample.R
: entry scripts for {drake}
when pipeline is executed through
run_single_sample_r()
or run_integration_r()
, or CLIplan_custom.R
: here you can define your own {drake}
pipeline (plan) to extend {scdrake}
(more on this topic in vignette("scdrake_extend")
)example_data/
: example datasets. Those are raw feature-barcode matrices output by
cellranger
and provided by 10x Genomics.scdrake.Rproj
: RStudio project file. If you open this file in RStudio, your working directory
will be automatically set to project's root directory.In this three step guide we are going to run the first stage (01_input_qc
) of the single-sample pipeline.
This stage imports scRNA-seq data, computes per-cell quality control metrics (total number of UMI, number of detected genes,
percentage of expressed mitochondrial genes) and performs filtering, which can be either based on median absolute deviation
(dataset-sensitive filtering, default) or custom thresholds (custom filtering). You can read more about this stage and
its analysis steps, parameters and outputs in vignette("stage_input_qc")
.
The configuration files for all stages are stored in the config/
directory.
At this time we only need to modify a single parameter in the config file for the 01_input_qc
stage.
To do so, open config/single_sample/01_input_qc.yaml
and set value of path
inside INPUT_DATA
to
"example_data/pbmc1k"
such that it points to the directory with PBMC 1k example dataset, which has been downloaded
on project initialization:
INPUT_DATA: type: "cellranger" path: "example_data/pbmc1k" delimiter: "," target_name: "target_name"
Important: all paths in configs must be relative to project's root directory, or absolute (not recommended), unless otherwise stated.
Now we are ready to execute the single-sample pipeline. {drake}
pipelines are composed of individual steps called targets.
Each target is represented by R expression that returns its value (R object). When target is finished,
its value is saved into cache directory and can be used by other targets or load by the user into the current R session.
{drake}
allows to execute the pipeline such that specific targets are made. This is controlled by the DRAKE_TARGETS
parameter in the config/pipeline.yaml
file (see vignette("config_pipeline")
).
The default value of DRAKE_TARGETS
is report_input_qc
, which represents the final target of the 01_input_qc
stage -
a HTML report.
All files from this stage will be saved under the output/single_sample
directory. For the report,
it is output/single_sample/01_input_qc/01_input_qc.html
. The output directory is specified by the BASE_OUT_DIR
parameter in config/single_sample/00_main.yaml
.
The 00_main.yaml
config stores parameters that are common to all stages of the single-sample or integration pipeline,
e.g. titles or organism (see vignette("config_main")
).
Now let's run the pipeline:
run_single_sample_r()
scdrake --pipeline-type single_sample run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \ scdrake --pipeline-type single_sample run
singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/pbmc1k" \ path/to/scdrake_image.sif \ scdrake --pipeline-type single_sample run
Now you can inspect the HTML report located in output/single_sample/01_input_qc/01_input_qc.html
.
It gives you a summary view on the quality of the dataset and shows how each filtering type affects the number of cells.
The report is mostly self-explanatory as it explains all analysis steps that took place and their visual outputs.
In the three steps above you have seen the basic functionality of {scdrake}
. We can briefly summarize it:
01_input_qc
) has its own configuration file00_main.yaml
), which is used
throughout the pipeline's stagespipeline.yaml
Each stage or general config has its own vignette describing analysis steps, parameters and outputs (targets - R objects, HTML reports):
vignette("config_pipeline")
vignette("config_main")
01_input_qc
: reading in data, filtering, quality control -> vignette("stage_input_qc")
02_norm_clustering
: normalization, HVG selection, dimensionality reduction, clustering, cell type annotation
-> vignette("stage_norm_clustering")
01_integration
: reading in data and integration -> vignette("stage_integration")
02_int_clustering
: post-integration clustering and cell annotation -> vignette("stage_int_clustering")
cluster_markers
-> vignette("stage_cluster_markers")
contrasts
(differential expression) -> vignette("stage_contrasts")
You can navigate these vignettes in the top bar in the Articles drop-down menu.
For the integration pipeline you might be interested in its guide in vignette("scdrake_integration")
Note that the default config files are meant for the example PBMC data, and so before you analyse your own data, you should read the section below about important steps before you do so.
Let's practice a bit more with {scdrake}
and modify a cell filtering parameter.
This will also demonstrate the {drake}
's ability to skip finished targets.
Let's assume you decided to be more strict in dataset-sensitive cell filtering and want to use a MAD threshold of 2
.
At the same time, you want to save the report into a different file so it can be compared with the previous,
more benevolent filtering. Now open config/single_sample/01_input_qc.yaml
and change:
MAD_THRESHOLD: 2 INPUT_QC_BASE_OUT_DIR: "01_input_qc_strict"
After repeating the Step 3 above you can see the most time-consuming targets sce_raw
and empty_droplets
were skipped.
Also, we simply output the HTML report for the new parameter into a different directory
(output/single_sample/01_input_qc_strict/01_input_qc.html
), and so we can easily compare it with the previous report.
If you open the new report, you can see that more cells were filtered out.
Now we will look at perhaps the most important outcomes of a scRNA-seq data analysis: cell clustering and cell type annotation.
These particular procedures are implemented in the second stage 02_norm_clustering
of the single-sample pipeline
(see vignette("stage_norm_clustering")
). Clustering and annotation is preceded by other necessary steps, most importantly:
normalization, highly variable genes selection, PCA calculation, and dimensionality reduction using principal components
(UMAP, t-SNE). A similar stage is also available in the integration pipeline: 02_int_clustering
(see vignette("stage_int_clustering")
).
Initially, we won't make any changes to the config/single_sample/02_norm_clustering.yaml
config and keep its sensible defaults:
scuttle::computePooledFactors()
)celldex::HumanPrimaryCellAtlasData()
) and Monaco Immune Data (celldex::MonacoImmuneData()
)What we need in order to perform the 02_norm_clustering
stage is just to change targets for {drake}
:
open config/pipeline.yaml
and change DRAKE_TARGETS
to ["report_norm_clustering", "report_norm_clustering_simple"]
These targets will make two reports for this stage. The first one includes technical details about important analysis steps, while the second one is simplified and limited to dimensionality reduction plots.
Now run the pipeline using the familiar command from the Step 3.
Then you can go through the report located in the output/single_sample/02_norm_clustering/02_norm_clustering.html
file.
The report provides all necessary information about the performed analysis steps, and graphical outputs, most importantly
dimensionality reduction plots displaying cell-cluster membership.
Also, at the end of the report you can find results from the automatic cell type annotation. You can see that clusters represents the main immune cell types. If you are interested in markers of the predicted cell types, you can click on the Marker heatmaps PDF link.
In {scdrake}
there is a convenient way how to visualize expression of markers of interest in reduced dimensions.
It simply uses a CSV file that defines groups of markers given by their symbol. This is an example of such CSV file
(selected_markers.csv
) that is bundled with {scdrake}
and automatically copied to the root of a new project on its initialization:
Naive_CD4+_T,IL7R:CCR7 Memory_CD4+,IL7R:S100A4 CD14+_Mono,CD14:LYZ B,MS4A1 CD8+_T,CD8A FCGR3A+_Mono,FCGR3A:MS4A7 NK,GNLY:NKG7 DC,FCER1A:CST3 Platelet,PPBP
The first column specifies a group of markers (cell types in this case) and the second column lists the marker symbols separated by colon. Note that there is no header.
To enable usage of this file and subsequent generation of expression plots,
in config/single_sample/02_norm_clustering.yaml
simply change SELECTED_MARKERS_FILE
to "selected_markers.csv"
Now you have two options how to tell {scdrake}
to make the expression plots (PDFs) using the CSV file:
report_norm_clustering
and report_norm_clustering_simple
targets again as they
also provide links to generated PDF files (under the Dimensionality reduction plots section).config/pipeline.yaml
change DRAKE_TARGETS
to ["selected_markers_plots_files_out"]
. This target outputs the PDF
files to the output/single_sample/selected_markers_plots
directory (to which the links in the report lead to).There are three PDFs with different reduced dimensions (PCA, t-SNE, UMAP) and each contains a matrix of expression plots of genes supplied in the CSV file.
When identities of clusters are known, it is possible to transform cluster numbers into more informative cell type names.
For that there is a handy parameter CELL_GROUPINGS
in config/single_sample/02_norm_clustering.yaml
.
Let's say you want to manually annotate some clusters in Leiden clustering with resolution 0.4
and 0.6
.
For that you can change the parameter as follows (please, do not search for any biological relevance here - the used
annotation is solely purposed to demonstrate the usage of this parameter):
CELL_GROUPINGS: - cluster_graph_leiden_r0.4_annotated: source_column: "cluster_graph_leiden_r0.4" description: "Graph-based clustering (Leiden alg., r = 0.4), annotated clusters" assignments: 1: "memory CD4+" 6: "B" 7: "memory CD4+" cluster_graph_leiden_r0.6_annotated: source_column: "cluster_graph_leiden_r0.6" description: "Graph-based clustering (Leiden alg., r = 0.6), annotated clusters" assignments: 2: "TNK" 3: "CD4 naive"
There are two items in the CELL_GROUPINGS
, and each of them is referencing Leiden clustering with different resolution.
The first item/grouping has the following structure:
cluster_graph_leiden_r0.4_annotated
: a name of new cell grouping addedsource_column: "cluster_graph_leiden_r0.4"
: a name of cell metadata column to use for new assignments. This will be
usually a result from clustering and you can find how clusterings are named in vignette("stage_norm_clustering")
under the Outputs tab.description: "Graph-based clustering ..."
: a description of the new assignment. If it was not set,
the name of the grouping (cluster_graph_leiden_r0.4_annotated
) would be used instead.assignments
: assignments of the old labels to new ones. That is, cluster 6
will be renamed to "B"
,
and clusters 1
and 7
will be merged to a new label "memory_CD4+"
.
All unspecified old labels will be kept as they are.These new cell groupings can be then referenced in other parameters. For example, you can add them to the
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
list, which specifies variables to color by cells in dimensionality reduction
plots:
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER: - "phase": "Cell cycle phases" "doublet_score": "Doublet score" "total": "Total number of UMI" "detected": "Detected number of genes" "cluster_graph_leiden_r0.4_annotated": null "cluster_graph_leiden_r0.6_annotated": null
For the new groupings we are not specifying the plot titles - those are taken from CELL_GROUPINGS
from the description
item.
Again, there are two options how to generate the plots:
config/pipeline.yaml
change DRAKE_TARGETS
to report_norm_clustering
or report_norm_clustering_simple
,
and rerun the pipeline. After that you can see the new plots under the Dimensionality reduction plots section in the report.["dimred_plots_other_vars_files_out"]
for DRAKE_TARGETS
.
The files will be saved in the output/single_sample/dimred_plots
directory and named as
cluster_graph_leiden_r0.4_annotated_<dimred_name>
etc. (and again, those files are referenced from the report).It is also possible to assign custom cell metadata from a CSV file - see the ADDITIONAL_CELL_DATA_FILE
parameter in
vignette("stage_norm_clustering")
or "I want to manually annotate cells" section in vignette("scdrake_faq")
.
Note that CELL_GROUPINGS
and ADDITIONAL_CELL_DATA_FILE
are also available in the 02_int_clustering
stage of the
integration pipeline.
Besides the visualization of marker genes of interest, one can also calculate statistical tests to discover markers
that drive the separation of cells into clusters. We won't go into details here; see vignette("stage_cluster_markers")
that describes the cluster_markers
stage, which is available for both single-sample and integration pipelines.
It is also possible to perform differential gene expression between clusters (or in general, groups of cells) through the
contrasts
stage (see vignette("stage_contrasts")
).
Note that in these stages you can reference cell groupings that were defined in the CELL_GROUPINGS
parameter.
The default config files are designed to work with the provided example data. The parameters mostly default to arguments
in underlying functions in packages that are called by {scdrake}
or are set according to common/best analysis practices
(e.g. those described in OSCA). Anyway, it is worth saying that:
INPUT_DATA
)Below you can find important parameters which you should review before you run the pipeline on your data for the first time. All parameters are documented in vignettes for their respective stages or general configs.
pipeline.yaml
) {.unlisted}-> vignette("config_pipeline")
DRAKE_TARGETS
: what targets you want to make00_main.yaml
) {.unlisted}-> vignette("config_main")
ORGANISM
, ANNOTATION_LIST
: an organism name to match a proper annotation package defined in ANNOTATION_LIST
ENSEMBL_SPECIES
: an ENSEMBL species name used to build links to Ensembl genes websiteBASE_OUT_DIR
: a path to base output directory to which each stage's files will be saved01_input_qc.yaml
) {.unlisted}-> vignette("stage_input_qc")
INPUT_DATA
: specifies the type and location of input dataEMPTY_DROPLETS_ENABLED
: enable/disable removal of empty dropletsSAVE_DATASET_SENSITIVE_FILTERING
: enable dataset-sensitive or custom thresholds cell filtering02_norm_clustering.yaml
) {.unlisted}-> vignette("stage_norm_clustering")
NORMALIZATION_TYPE
: normalization of UMI/counts either by {scuttle}
({scran}
) or {sctransform}
({Seurat}
)HVG_METRIC
, HVG_SELECTION
, HVG_SELECTION_VALUE
: how to select highly variable genesPCA_SELECTION_METHOD
, PCA_FORCED_PCS
(if PCA_SELECTION_METHOD
is "forced"
):
how to select the number of first principal componentsCLUSTER_GRAPH_LEIDEN_ENABLED
, CLUSTER_GRAPH_LEIDEN_RESOLUTIONS
CLUSTER_GRAPH_LOUVAIN_ENABLED
, CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS
CLUSTER_GRAPH_WALKTRAP_ENABLED
CLUSTER_KMEANS_K_ENABLED
, CLUSTER_KMEANS_KBEST_ENABLED
, CLUSTER_KMEANS_K
CLUSTER_SC3_ENABLED
, CLUSTER_SC3_K
, CLUSTER_SC3_N_CORES
01_integration.yaml
) {.unlisted}-> vignette("stage_integration")
INTEGRATION_SOURCES
: similar to the INPUT_DATA
parameter, specify types and paths to datasets to be integrated02_int_clustering.yaml
) {.unlisted}-> vignette("stage_int_clustering")
02_norm_clustering.yaml
config of the single-sample pipelinecluster_markers.yaml
, contrasts.yaml
) {.unlisted}-> vignette("stage_cluster_markers")
, vignette("stage_contrasts")
CLUSTER_MARKERS_SOURCES
, CONTRASTS_SOURCES
: for which cell groupings (e.g. clusters from a particular clustering
algorithm) to calculate markers or differential gene expressionWhen a new version of {scdrake}
is released, you can update your project with (assuming your working directory
is in a {scdrake}
project's root):
update_project()
scdrake update-project
docker exec -it -u rstudio -w /path/to/scdrake/project <CONTAINER ID or NAME> \ scdrake update-project
singularity exec -e --no-home \ --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \ --pwd "/home/${USER}/scdrake_projects/project" \ path/to/scdrake_image.sif \ scdrake update-project
This will overwrite project files by the package-bundled ones:
Rmd/
run_single_sample_r()
and run_integration_r()
(wrappers around drake::r_make()
):
_drake_single_sample.R
and _drake_integration.R
config/
(*.default.yaml
)By default, you will be asked if you want to continue, as you might lose your local modifications.
Now you should know the {scdrake}
basics:
For the integration pipeline you might be interested in its guide in vignette("scdrake_integration")
.
For the full insight into {scdrake}
you can also read the following vignettes:
vignette("scdrake_docker")
)vignette("scdrake_advanced")
vignette("scdrake_extend")
{drake}
basics: vignette("drake_basics")
{drake}
book: https://books.ropensci.org/drake/vignette("pipeline_overview")
vignette("scdrake_faq")
vignette("scdrake_cli")
vignette("scdrake_config")
vignette("scdrake_envvars")
## -- This is a cleanup after the vignette is built. fs::dir_delete(tmp_dir)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.