knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/", dpi = 36 )
Here, we demonstrate a grid search of clustering parameters with a mouse
hippocampus VeraFISH dataset. BANKSY currently provides four algorithms for
clustering the BANKSY matrix with clusterBanksy: Leiden (default), Louvain,
k-means, and model-based clustering. In this vignette, we run only Leiden
clustering. See ?clusterBanksy
for more details on the parameters for
different clustering methods.
start.time <- Sys.time()
The dataset comprises gene expression for 10,944 cells and 120 genes in 2
spatial dimensions. See ?Banksy::hippocampus
for more details.
# Load libs library(Banksy) library(SummarizedExperiment) library(SpatialExperiment) library(scuttle) library(scater) library(cowplot) library(ggplot2) # Load data data(hippocampus) gcm <- hippocampus$expression locs <- as.matrix(hippocampus$locations)
Here, gcm
is a gene by cell matrix, and locs
is a matrix specifying the
coordinates of the centroid for each cell.
head(gcm[,1:5]) head(locs)
Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:
se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs) colData(se) <- cbind(colData(se), spatialCoords(se)) # QC based on total counts qcstats <- perCellQCMetrics(se) thres <- quantile(qcstats$total, c(0.05, 0.98)) keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2]) se <- se[, keep]
Next, perform normalization of the data.
# Normalization to mean library size se <- computeLibraryFactors(se) aname <- "normcounts" assay(se, aname) <- normalizeCounts(se, log = FALSE)
BANKSY has a few key parameters. We describe these below.
For characterising neighborhoods, BANKSY computes the weighted neighborhood
mean (H_0
) and the azimuthal Gabor filter (H_1
), which estimates gene
expression gradients. Setting compute_agf=TRUE
computes both H_0
and H_1
.
k_geom
specifies the number of neighbors used to compute each H_m
for
m=0,1
. If a single value is specified, the same k_geom
will be used
for each feature matrix. Alternatively, multiple values of k_geom
can be
provided for each feature matrix. Here, we use k_geom[1]=15
and
k_geom[2]=30
for H_0
and H_1
respectively. More neighbors are used to
compute gradients.
For datasets generated using Visium v1/v2, use
k_geom=18
(ork_geom <- c(18, 18)
ifcompute_agf = TRUE
), since that corresponds to taking as neighbourhood two concentric rings of spots around each spot.
We compute the neighborhood feature matrices using normalized expression
(normcounts
in the se
object).
k_geom <- c(15, 30) se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
computeBanksy
populates the assays
slot with H_0
and H_1
in this
instance:
se
The lambda
parameter is a mixing parameter in [0,1]
which
determines how much spatial information is incorporated for downstream analysis.
With smaller values of lambda
, BANKY operates in cell-typing mode, while at
higher levels of lambda
, BANKSY operates in domain-finding mode. As a
starting point, we recommend lambda=0.2
for cell-typing and lambda=0.8
for
zone-finding, except for datasets generated using the Visium v1/v2 technology, for which
we recommend
lambda=0.2
for domain finding. See the note in the tutorial on the
main page for more info.
Here, we run lambda=0
which corresponds to non-spatial
clustering, and lambda=0.2
for spatially-informed cell-typing. We compute PCs
with and without the AGF (H_1
).
lambda <- c(0, 0.2) se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
runBanksyPCA
populates the reducedDims
slot, with each combination of
use_agf
and lambda
provided.
reducedDimNames(se)
Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This
admits two parameters: k_neighbors
and resolution
. k_neighbors
determines
the number of k nearest neighbors used to construct the shared nearest
neighbors graph. Leiden clustering is then performed on the resultant graph
with resolution resolution
. For reproducibiltiy we set a seed for each
parameter combination.
k <- 50 res <- 1 se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
clusterBanksy
populates colData(se)
with cluster labels:
colnames(colData(se))
To compare clustering runs visually, different runs can be relabeled to
minimise their differences with connectClusters
:
se <- connectClusters(se)
Visualise spatial coordinates with cluster labels.
cnames <- colnames(colData(se)) cnames <- cnames[grep("^clust", cnames)] cplots <- lapply(cnames, function(cnm) { plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) + coord_equal() + labs(title = cnm) + theme(legend.title = element_blank()) + guides(colour = guide_legend(override.aes = list(size = 2))) }) plot_grid(plotlist = cplots, ncol = 2)
Compare all cluster outputs with compareClusters
. This function computes
pairwise cluster comparison metrics between the clusters in colData(se)
based
on adjusted Rand index (ARI):
compareClusters(se, func = "ARI")
or normalized mutual information (NMI):
compareClusters(se, func = "NMI")
See ?compareClusters
for the full list of comparison measures.
Vignette runtime:
Sys.time() - start.time
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.