Single-cell transcriptome sequencing data are subject to substantial technical variation and batch effects that can confound the classification of cellular sub-types. Unfortunately, current clustering algorithms don't account for this uncertainty. To address this shortcoming, we have developed a noise perturbation algorithm called BEARscc that is designed to determine the extent to which classifications by existing clustering algorithms are robust to observed technical variation.
BEARscc makes use of ERCC spike-in measurements to model technical variance as a function of gene expression and technical dropout effects on lowly expressed genes. In our benchmarks, we found that BEARscc accurately models read count fluctuations and drop-out effects across transcripts with diverse expression levels. Applying our approach to publicly available single-cell transcriptome data of mouse brain and intestine, we have demonstrated that BEARscc identified cells that cluster consistently, irrespective of technical variation. For more details, see the manuscript that is now available on bioRxiv.
Installing BEARscc is easy. You can download a source package here. You can then use install.packages
, but give it the location of the downloaded file:
install.packages('builds/BEARscc_0.99.8.tar.gz', repos = NULL, type="source")
Here we provide a limited illustrative example of BEARscc on example data. Load the previously package example containing data.frame
objects with spike-in counts, spike-in concentrations, and endogenous counts. A
comprehensive vignette describes how these objects are combined to form a SingleCellExperiment
object. For
now we simply illustrate how to use this object with BEARscc to create simulated technical replicates and
evaluate clusters for noise. For a more comprehensive tutorial and vignette please examine our BioConductor
submission.
data("BEARscc_examples")
Estimate noise inputting ERCC known concentrations, and both endogenous and spike-in counts matrices into the estimate_noiseparameters()
function.
BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, file="noise_estimation",
model_view=c("Observed","Optimized"))
Several options exist:
alpha_resolution
Because the alpha parameter is enumerated discretely and empirically evaluated for each value for each spike-in, it is necessary to specify the resolution (how small the step is between each explicit alpha test); this parameter defines the resolution of alpha values tested for maximum empirical fit to spike-ins. It is recommended that users utilize the default resolution.write.noise.model=TRUE
outputs two tab-delimited files containing the dropout effects and noise model parameters; this allows users to apply the noise generation on a seperate high compute node. plot==TRUE
will plot all linear fits and individual ERCCs distributions across samples, where model_view=c("Observed", "Optimized", "Poisson", "Neg. Binomial"
determines the statistical distributions that should be plotted for the ERCC plots.file="noise_estimation"
determines the root name for all plots, which write to the current working directory unless a path is contained in the root name. Following estimation of noise, the parameters are used to generate a noise-injected counts matrix.
BEAR_examples.sce <- simulate_replicates(BEAR_examples.sce, n=10)
List objects generated by the function estimate_noiseparameters()
are now annoated in the SingleCellExperiment
object BEAR_examples.sce and the variable n
is the desired number of clusters. The resulting object is another SingleCellExperiment
object, where the metadata contains a long list (simulated_replicates
) where each element is a noise-injected counts matrix, and one element is the original counts matrix.
For larger datasets it is recommended that the user set write.noise.model=TRUE
and copy the written bayesian drop-out and noise estimate files with the observed counts table to a high powered computing environment.
The script HPC_generate_noise_matrices
contains simulate_replicate()
functions that are adapted to a parallel environment along with suggested code, which is commented out for user-modification. The script generates seperate noise-injected counts files, which can be loaded into R as a list or re-clustered seperately in a high powered compute environment.
After generating simulated replicates, these should be re-clustered using the clustering method applied to the original dataset. For simplicity, here we use hierarchical clustering on a euclidean distance metric to identify two clusters. In our experience, some published clustering algorithms are sensitive to cell order, so we suggest scrambling the order of cells for each noise iteration as we do below in the function, recluster()
.
To quickly recluster a list, we define a reclustering function:
recluster <- function(x) {
x <- data.frame(x)
scramble <- sample(colnames(x), size=length(colnames(x)), replace=FALSE)
x <- x[,scramble]
clust <- hclust(dist(t(x), method="euclidean"),method="complete")
clust <- cutree(clust,2)
data.frame(clust)
}
First we apply the clustering function to our original observed expression data:
OG_clusters<-recluster(data.frame(assay(BEAR_examples.sce,
"observed_expression")))
colnames(OG_clusters)<-"Original"
We then apply the function recluster()
to all noise-injected counts matrices and the original counts matrix and manipulate the list into a data.frame
.
cluster.list<-lapply(metadata(BEAR_examples.sce)$simulated_replicates,
`recluster`)
clusters.df<-do.call("cbind", cluster.list)
colnames(clusters.df)<-names(cluster.list)
If running clustering algorithms on a seperate high power cluster, the user should retrieve labels and format as a data.frame
of cluster labels, where the last column must be the original cluster labels derived from the observed count data. As an example, examine the file, inst/example/example_clusters.tsv.
Using the cluster labels file as described above, we can generate a noise consensus matrix using:
noise_consensus <- compute_consensus(clusters.df)
Using the aheatmap()
function in the NMF
library, the consensus matrix result of 30 iterations of BEARscc on the provided example data will look this:
To reproduce the plot run:
library("NMF")
aheatmap(noise_consensus, breaks=0.5)
In order to interpret the noise consensus, we have defined three cluster (and analagous cell) metrics. Stabilty indicates the propensity for a putative cluster to contain the same cells across noise-injected counts matrices. Promiscuity indicates a tendency for cells in a putative cluster to associate with other clusters across noise-injected counts matrices. Score represents the promiscuity substracted from the stability.
We have found it useful to identify the optimal number of clusters in terms of resiliance to noise by examining these metrics by cutting hierarchical clustering dendograms of the noise consensus and comparing the results to the original clustering labels. To do this create a vector containing each number of clusters one wishes to examine (the function automatically determines the results for the dataset as a single cluster) and then cluster the consensus with cluster_consensus()
:
vector <- seq(from=2, to=5, by=1)
BEARscc_clusts.df <- cluster_consensus(noise_consensus,vector)
We add the original clustering to the data.frame
:
BEARscc_clusts.df <- cbind(BEARscc_clusts.df, Original=clusters.df$Original_counts)
Compute cluster metrics by running the command:
cluster_scores.dt <- report_cluster_metrics(BEARscc_clusts.df, noise_consensus, plot=TRUE, file="example")
The output is a melted data.frame
that displays the name of each cluster, the size of each cluster, the metric (Score, Promiscuity, Stability), the value of each metric for the respective cluster and clustering, the clustering in question (1,2,...,Original), whether the cluster consists of only one cell, and finally the mean of each metric across all clusters in a clustering.
An example of the resulting plot for 3 noise-injected perturbations is provided for the user's reference: inst/example/example_cluster_scores.pdf. It is evident from the plot that one cluster is optimal and outperforms the original clustering which bifurcated this set of purely technical data into 2 clusters.
Likewise, the cell metrics may be computed using:
cell_scores.df <- report_cell_metrics(BEARscc_clusts.df, noise_consensus)
The output is a melted data.frame
that displays the name of each cluster to which the cell belongs, the cell label, the size of each cluster, the metric (score, Promiscuity, Stability), the value for each metric, and finally the clustering in question (1,2,...,Original).
These results can be plotted to visualize cells in the context of different clusterings using ggplot2
.
This software is made available under the terms of the GNU General Public License v3
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.