Currently the results of subsampling (or consensus between clusters) is stored as a NxN matrix, with entries $s_{ij}$ being the proportion of times cell $i$ and $j$ are in the same cluster over the $B$ clusterings of which they both were assigned.
Note: For subsampling, they may not be assigned because they were not both subsampled; for consensus they may not be assigned because one of them was a -1.
Then hier01
clusters this NxN matrix by taking entries $d_{ij}=1-s_{ij}$ and performing hierarchical clustering on the NxN distance matrix $D$. Afterward, we determine clusters from the hierarchical clustering by starting at the top of the tree and going down the tree until the resulting clustering satisfies the criteria that the cells in the cluster are sufficiently similar to each other, with a parameter $\alpha$ that controls the required similarity. There are two choices
1) $\forall i,j \in \mathcal{C}, s_{ij} \geq 1-\alpha$, i.e. the maximal allowable $d_{ij}$ for any $i,j$ in the cluster is $\alpha$.
2) $\forall i \in \mathcal{C}, mean(s_{ij}) \geq 1-\alpha$ [default]
Instead of calculating and storing the $NxN$ matrix, we would like to store the $NxB$ matrix of the cluster assignments.
Davide proposed using KNN functionality to determine the neighbors of a cell $i$ whose distance to other cells is such that $d_{ij}<\alpha$. Then we would need to figure out how to get a cluster assignment out of that.
1) We use findNeighbors
from the BiocNeighbors
package using threshold=alpha
.
To Do This function only works with Manhattan distance and Euclidean distance. We would need to see if the algorithm is even valid for our arbitrary distance (basically a Hamming distance) -- Aaron would probably know. Then we would need get Aaron to allow us to do a pull request and write C code for our distance function to be added to BiocNeighbors
. Luckily it has already been updated to allow for two, so adding a third would hopefully not be a problem.
2) Convert these results into an object for igraph
.
To Do Not clear that there is an existing function to do this. Might be relevant add on to BiocNeighbors
. More generally if we create a wrapper package for the clustering methods that we were talking about, I think interface between igraph and BiocNeighbors
may be needed, but I haven't looked into this.
3) Determine the clusters. There are several logical choices, only some of which would not require any further parameters to be passed.
a) Take all maximally connected components with the function components
in igraph
. This is a far looser requirement than any we have now.
b) Take all maximal cliques (i.e. fully connected components) with the function igraph_maximal_cliques
. This would basically correspond to the maximal option we currently have
c) From each maximally connected component iteratively prune off vertices, based on the criteria that the remaining cluster must have all vertices with degree $< |V|/2$, where $|V|$ is the size of the cluster -- i.e. require them to be connected to at least half the other vertices in the cluster. This is similar to the mean criteria above, if we had considered the median instead of the mean. (It would have to be iterative since removing a node will change the degree of all remaining nodes). Could use some varient of this without iteration, where we look at degree distribution per connected components and pick some value for degree edge; perhaps by looking at distribution it's possible to find a degree number that would guaranteed mathematically that the resulting vertices have degree at least $<|V|/2$ without iteration (e.g. if removing 25% of vertices means that the remaining had degree $>.75|V|$ on full graph, then clearly pruned graph would have degree $>|V|/2$).
However, note that in hier01
our iterations go down the tree and split the existing cluster rather than pruning away cell by cell. Without that difference, this pruning could easily devolve into b) with much worse speed and perhaps even worse (after all if a connected component consists of 2 maximal cliques, then maximal cliques would find both, while iterative would only get one of them). So seems like its important to have ability to "capture" those that are discarded, and not clear how to do that.
d) apply some community clustering algorithm to the $\alpha$-graph, which is equivalent to applying it to each maximally connected component (for any algorithm worth anything). This would likely require some parameter to the clustering. Maybe question is how bad this parameter is relative to our current (arbitrary) options; advantage of our current method is at least based on a guarantee on resulting clustering regarding $\alpha$ in some way, which a clustering wouldn't necessarily.
It is unclear how different a) and b) will from each other in practice -- i.e. how much extra is added by including in any cell with small alpha to at least one cell. Clearly if not, none of these differences matter. But seems like we need some kind of method like our current one that is somewhere in between the two, since we don't know.
findNeighbors
is a upper-level function that finds all neighbors within a given distance of each point. It takes a parameter BiocNeighborParam
to define the algorithm to do so (this is a argument of class BiocNeighborParam
). findKNN
is the corresponding function that finds the k-nearest neighbors of each point.
findNeighbors
has two possible options for the algorithm, which is basically a choice of how to most efficiently organize the data in the search: "Kmknn" or "VPTree". findNeighbors
essentially dispatches rangeFindKmknn
or rangeFindVptree
. Note that these functions allow the threshold
argument to be a vector, specifying a different threshold for each point, though not clear this is useful for us. findNeighbors
will return not just the indices of the points, but also their distances (if they are less than the threshold). (findKNN
also has approximate methods Annoy
and Hnsw
but these do not appear to be options for finding within a particular distance; it would be important to consider in comparison methods, like Seurat that build a knn neighbor graph, whether the speed increases they have for large datasets are based on using approximations or not).
(I would note however, that the help function for BiocNeighborParam
only lists "2 concrete subclasses", KmknnParam
and AnnoyParam
. It's not clear why the help doesn't list the other two options as subclasses.)
biocNeighbors has two methods for preprocessing data for fast (exact) computation (from :
buildKmknn
using stat::kmeans
, all R code).range_find_kmknn
is run, which is a call to C code: .Call(cxx_range_find_kmknn,...)
buildVptree
which is a C call, .Call(cxx_build_vptree,...)
.range_find_vptree
(.Call(cxx_range_find_vptree,...)
)We need to be sure that our proposed distance satifies the triangle-inequality / defines a proper metric. Otherwise, neither of these algorithms may work... Hamming distance is a proper metric, but we are doing an alternative of it, to deal with NA values (i.e. not use the NA values, but then normalize the distance to account for the number of positions shared).
How important is this? For subsampling, we can have samples that were not subsampled together, and thus have NA for their clustering. Our default implementation doesn't actually do this -- instead it assignes every sample, not only the subsampled, based on the clusters defined by the subsampled samples. But if user picked the other option, then it would be a problem.
For consensus clustering, we get -1 values, and we do not count that against the sample. Namely, we ignore the -1, get the proportion of times together; we then filter out those with too many -1.
If dealing with the proportion becomes a problem, rather than straightforward Hamming distance, then we could maybe reconsider these options.
Algorithms Clearly the kmknn is not a good option for our distance, since kmeans on our $N x B$ matrix is not going to be terribly informative.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.