This stage imports multiple SingleCellExperiment
objects and performs integration using one or several methods.
r emoji::emoji("gear")
Config file: config/integration/01_integration.yaml
r emoji::emoji("clipboard")
HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_integration"]
r emoji::emoji("scroll")
Example report
(used config)
r emoji::emoji("ladder")
Structure
drake
caches of multiple single-sample pipelinesSingleCellExperiment
objectsbatchelor::multiBatchNorm()
,
theory)batchelor::rescaleBatches()
,
theory):
scale counts so that the average count within each batch is the same for each genebatchelor::regressBatches()
,
theory):
fit a linear model to each gene regress out uninteresting factors of variation, returning a matrix of residualsbatchelor::fastMNN()
,
theory)The steps below are performed for each integration method, plus separately for HVGs with removed cell cycle-related genes (if this removal was requested).
02_norm_clustering
stage), but, in this case,
the selection of PCs can be configured for each integration method.Config for this stage is stored in the config/integration/01_integration.yaml
file.
Directory with this file is read from SCDRAKE_INTEGRATION_CONFIG_DIR
environment variable upon {scdrake}
load or
attach, and saved as scdrake_integration_config_dir
option. This option is used as the default argument value in
several {scdrake}
functions.
INTEGRATION_SOURCES: - pbmc1k: path: "path/to/.drake or path/to/sce.Rds" path_type: "drake_cache" description: "10x Genomics PBMC 1k dataset" hvg_rm_cc_genes: True hvg_cc_genes_var_expl_threshold: 5 pbmc3k: path: "path/to/.drake or path/to/sce.Rds" path_type: "drake_cache" description: "10x Genomics PBMC 3k dataset" hvg_rm_cc_genes: False hvg_cc_genes_var_expl_threshold: null
Type: list of named lists
This parameter specifies datasets to be integrated. Those are read in from single-sample pipeline results
(the sce_final_norm_clustering
targets) taken from corresponding {drake}
caches or from SCE objects saved in Rds format.
The latter should correspond to sce_final_norm_clustering
targets, but one can modify their e.g. colData()
as needed.
Each name of a nested list specifies the name of a single-sample, e.g. pbmc1k
. Parameters for each single-sample are:
path: "path/to/.drake or path/to/sce.Rds"
(character scalar): a path to {drake}
cache directory or SCE object Rds file
(depends on path_type
).path_type: "drake_cache"
(character scalar: "drake_cache"
| "sce"
): a type of input data - either a {drake}
cache directory or SCE object Rds file.description: "10x Genomics PBMC 1k dataset"
(character scalar): a description of the single-sample, will appear in reports.hvg_rm_cc_genes: True
(logical scalar): if True
, cell cycle-related genes will be removed prior to selection of
highly variable genes (HVGs). This is the same method of cell cycle correction as in the 02_norm_clustering
stage
of the single-sample pipeline (see vignette("stage_norm_clustering")
for more details).hvg_cc_genes_var_expl_threshold: 5
(numeric scalar): a threshold on cell cycle variance explained.
Genes with var. expl. greater than this threshold will be marked as CC-related.Currently, there are some limitations of which datasets can be integrated:
{scran}
method.HVG_METRIC
of "gene_var"
or "gene_cv2"
).This is done prior to integration, and is same for all methods specified in the INTEGRATION_METHODS
(see below).
If a single-sample has the hvg_rm_cc_genes
set to True
, cell cycle-related genes will be removed
prior to HVG combination and selection.
HVG_COMBINATION_INT: "hvg_metric"
Type: character scalar ("hvg_metric"
| "intersection"
| "union"
| "all"
)
How to combine HVGs from single-samples:
"hvg_metric"
: combine HVG metrics (gene variance or CV2) and then apply HVG selection similar to procedure in
the 02_norm_clustering
stage of the single-sample pipeline. Note that in order to combine HVG metrics,
all samples had to have HVG_METRIC
set to either "gene_var"
(scran::combineVar()
) or
"gene_cv2"
(scran::combineCV2()
)."intersection"
"union"
"all"
: use all common genes from all samples as HVGs (without duplicates).HVG_SELECTION_INT: "top" HVG_SELECTION_VALUE_INT: 3000
Type: character scalar, integer scalar
HVG selection strategy. Only relevant when HVG_COMBINATION_INT
is "hvg_metric"
,
i.e. this selection will be applied to combined HVG metrics.
These parameters are similar to HVG_SELECTION
and HVG_SELECTION_VALUE
in the 02_norm_clustering
stage
of the single-sample pipeline (see vignette("stage_norm_clustering")
for more details).
INTEGRATION_METHODS: - uncorrected: pca_selection_method: "forced" pca_forced_pcs: 15 tsne_perp: 20 tsne_max_iter: 1000 rescaling: pca_selection_method: "forced" pca_forced_pcs: 15 tsne_perp: 20 tsne_max_iter: 1000 integration_params: log.base: 2 pseudo.count: 1 regression: pca_selection_method: "corrected" pca_forced_pcs: 15 tsne_perp: 20 tsne_max_iter: 1000 integration_params: d: 50 mnn: pca_selection_method: "corrected" pca_forced_pcs: 15 tsne_perp: 20 tsne_max_iter: 1000 integration_params: k: 20 prop.k: null cos.norm: True ndist: 3 d: 50 merge.order: null auto.merge: True harmony: pca_selection_method: null pca_forced_pcs: null tsne_perp: 20 tsne_max_iter: 1000 integration_params: dims.use: 50 theta: null lambda: null sigma: 0.1 nclust: null tau: 0 block.size: 0.05 max.iter.harmony: 10 max.iter.cluster: 20 epsilon.cluster: 0.00001 epsilon.harmony: 0.0001
Type: list of named lists
Each named list is specifying parameters for one integration method. The ones specified in the integration_params
list
are used directly in an integration method, while the others are used prior/after integration.
The "uncorrected"
and at least one integration method must be set (by default, all methods are performed).
Here are descriptions and possible parameters (some of them are used by multiple methods, and thus described only once)
for each integration method:
uncorrected
: correction for inter-batch sequencing depth with batchelor::multiBatchNorm()
.
This method is required, because it is used for differential expression analysis (cluster markers and contrasts).pca_selection_method: "forced"
(character scalar: "forced"
| "elbow"
| "technical_noise"
):
see the PCA_SELECTION_METHOD
parameter in vignette("stage_norm_clustering")
.pca_forced_pcs: 15
(integer scalar): force this number of selected PCs (if pca_selection_method
is "forced"
).tsne_perp: 20
(numeric scalar): t-SNE perplexity.tsne_max_iter: 1000
(positive integer scalar): a maximum number of t-SNE iterations.
rescaling
: scale counts so that the average count within each batch is the same for each gene
(batchelor::rescaleBatches()
).
pca_selection_method: "forced"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params
: directly passed to batchelor::RescaleParam()
used for batchelor::rescaleBatches()
.
Default values from this function are used by default.
log.base: 2
(numeric scalar): a base of the log-transformation.pseudo.count: 1
(numeric scalar): a pseudo-count used for the log-transformation.regression
: fit a linear model to each gene regress out uninteresting factors of variation,
returning a matrix of residuals (batchelor::regressBatches()
).
pca_selection_method: "corrected"
: in this integration method, batchelor::multiBatchPCA()
is used,
and all PCs (50 by default) from this method will be used if "corrected"
is specified.
Otherwise any other PCA selection method can be used.pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params
: directly passed to batchelor::RegressParam()
used for batchelor::regressBatches()
.
Default values from this function are used by default.
d: 50
(integer scalar): a number of PCs to compute in batchelor::multiBatchPCA()
.mnn
: fast mutual nearest neighbors (batchelor::fastMNN()
).
pca_selection_method: "corrected"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params
: directly passed to batchelor::FastMnnParam()
used for batchelor::fastMNN()
.
Default values from this function are used by default.
k: 20
(integer scalar): a number of nearest neighbors to consider when identifying MNNs.prop.k: null
(numeric scalar): a proportion of cells in each dataset to use for mutual nearest neighbor
searching. If set, the number of nearest neighbors used for the MNN search in each batch is redefined as
max(k, prop.k*N)
where N
is the number of cells in that batch.cos.norm: True
(logical scalar): if True
, cosine normalization is performed on the input data prior to PCA.ndist: 3
(numeric scalar): a threshold beyond which neighbors are to be ignored when computing correction
vectors. Each threshold is defined as a multiple of the number of median distances.d: 50
merge.order: null
(integer vector, list of lists, or null
): a merge order of single-samples.
Alternatively, a list of lists representing a tree structure specifying a hierarchical merge order
(not tested, see batchelor::fastMNN()
for more details).auto.merge: True
(logical scalar): if True
, automatically identify the "best" merge order.harmony
: projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions
(harmony::RunHarmony()
, Nature Methods).
Default values from this function are used by default. Note that Harmony does not compute an integrated expression matrix,
but only reduced dimensions corrected for batch effect that are subsequently used for UMAP and t-SNE, and clustering.
pca_selection_method: null; pca_forced_pcs: null
: not used, see dims.use
belowtsne_perp: 20
tsne_max_iter: 1000
integration_params
: directly passed to harmony::RunHarmony()
.
Default values from this function are used by default, except dims.use
.dims.use: 50
(integer scalar): number of first PCs to use for Harmony integration.
Selection of PCs is not applied for Harmony - PCA with dims.use
PCs is computed prior to the integration.theta: null
: diversity clustering penalty parameter. Specify for each variable in group.by.vars
("batch"
).
Default theta = 2
, theta = 0
does not encourage any diversity. Larger values of theta result in more diverse clusters.lambda: null
: didge regression penalty parameter. Specify for each variable in group.by.vars
("batch"
).
Default lambda = 1
. Lambda must be strictly positive. Smaller values result in more aggressive correction.sigma: 0.1
: width of soft kmeans clusters. Default sigma = 0.1
. Sigma scales the distance from a cell to
cluster centroids. Larger values of sigma result in cells assigned to more clusters. Smaller values of sigma make
soft k-means cluster approach hard clustering.nclust: null
: number of clusters in model. nclust = 1
equivalent to simple linear regression.tau: 0
: protection against overclustering small datasets with large ones. tau
is the expected number of cells per cluster.block.size: 0.05
: what proportion of cells to update during clustering. Between 0
to 1
, default 0.05
.
Larger values may be faster but less accurate.max.iter.harmony: 10
: maximum number of rounds to run Harmony. One round of Harmony involves one clustering and one correction step.max.iter.cluster: 20
: maximum number of rounds to run clustering at each round of Harmony.epsilon.cluster: 0.00001
: convergence tolerance for clustering round of Harmony. Set to -inf
to never stop early.epsilon.harmony: 0.0001
: convergence tolerance for Harmony. Set to -inf
to never stop early.Used to obtain basic
integration diagnostics.
See scran::buildSNNGraph()
and bluster::makeSNNGraph()
for more details.
INTEGRATION_SNN_K: 10
Type: integer scalar
A number of nearest neighbors to consider during graph construction.
INTEGRATION_SNN_TYPE: "rank"
Type: character scalar ("rank"
| "number"
| "jaccard"
)
A type of weighting scheme to use for shared neighbors.
INTEGRATION_SNN_CLUSTERING_METHOD: "walktrap"
Type: character scalar ("walktrap"
| "louvain"
)
A type of MNN clustering algorithm. See igraph::cluster_walktrap()
and igraph::cluster_louvain()
for more details.
SELECTED_MARKERS_FILE: null
Type: character scalar or null
Similar to the SELECTED_MARKERS_FILE
parameter in vignette("stage_norm_clustering")
.
INTEGRATION_REPORT_RMD_FILE: "Rmd/integration/01_integration.Rmd"
Type: character scalar
A path to RMarkdown file used for HTML report of this pipeline stage.
INTEGRATION_BASE_OUT_DIR: "01_integration"
Type: character scalar
A path to base output directory for this stage. It will be created under BASE_OUT_DIR
specified in 00_main.yaml
config.
INTEGRATION_SELECTED_MARKERS_OUT_DIR: "selected_markers" INTEGRATION_REPORT_HTML_FILE: "01_integration.html"
Type: character scalar
Names of files and directories created under INTEGRATION_BASE_OUT_DIR
. Subdirectories are not allowed.
INTEGRATION_KNITR_MESSAGE: False INTEGRATION_KNITR_WARNING: False INTEGRATION_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for rendering of stage's HTML report.
Here you can find description of the most important targets for this stage.
However, for a full overview, you have to inspect the
source code of the
get_integration_subplan()
function.
HTML report target name: report_integration
SingleCellExperiment
objectssce_int_import
: list of imported single-sample SCE objects, as defined in the INTEGRATION_SOURCES
parameter.
Because {drake}
cannot watch for changes in a cache, this target is always rerun.
Also, some constraints are checked (common normalization method and HVG metrics).
sce_int_raw_snn_clustering
: sce_int_import
with computed fast SNN clustering for each single-sample,
used for integration diagnostics
sce_int_processed
: sce_int_import
subsetted to common colData, and features (rows) and their corresponding
metadata (hvg_ids
, hvg_metric_fit
)
sce_int_multibatchnorm
: sce_int_processed
with SCE objects normalized for inter-batch sequencing depth
(batchelor::multiBatchNorm()
)
sce_int_df
: a tibble with integrated SCE objects. Each row is one method defined in the INTEGRATION_METHODS
parameter and either with or without removed cell cycle-related genes from HVGs.
See ?sce_int_df_fn
for information about what is added or modified in metadata()
of each integrated SCE object.
The integrated assay is named "integrated"
, except for Harmony integration, which only computes a batch-corrected
reduced dimensions (available in reducedDims(sce, "harmony")
).
sce_int_pca_df
: sce_int_df
with computed PCA and selected number of PCs. Also includes PC selection statistics
and plot
sce_int_clustering_df
: sce_int_pca_df
with computed MNN clustering used for integration diagnostics
sce_int_dimred_df
: sce_int_pca_df
with computed t-SNE and UMAP dimensionality reductions
hvg_int
, hvg_int_with_cc
: lists of HVGs and selection parameters. The latter is not NULL
when any of the
single-samples defined in the INTEGRATION_SOURCES
parameter has hvg_rm_cc_genes
set to True
.
hvg_plots_int_df
: a tibble with HVG diagnostic plots
sce_int_dimred_plots_df
: a tibble with dimensionality reduction plots
selected_markers_int_plots_df
: a tibble with plots of selected markers, similar to selected_markers_plots
in the 02_norm_clustering
stage of the single-sample pipeline
selected_markers_plots_files_out
: make this target to export the plots
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.