ExoLabel | R Documentation |
Runs Fast Label Propagation using disk space for constant memory complexity.
ExoLabel(edgelistfiles,
outfile=tempfile(),
mode=c("undirected", "directed"),
add_self_loops=FALSE,
ignore_weights=FALSE,
iterations=0L,
return_table=FALSE,
consensus_cluster=FALSE,
use_fast_sort=FALSE,
verbose=interactive(),
sep='\t',
tempfiledir=tempdir())
edgelistfiles |
Character vector of files to be processed. Each entry should be a machine-interpretable path to an edgelist file. See Details for expected format. |
outfile |
File to write final clusters to. Optional, defaults to a temporary file. Can be set to a vector of filepaths to run multiple clusterings (see Details). |
mode |
String specifying whether edges should be interpreted as undirected (default) or directed. Can be "undirected", "directed", or an unambiguous abbreviation. |
add_self_loops |
Should self loops be added to the network? If |
ignore_weights |
Should weights be ignored? If |
iterations |
Maximum number of times to process each node. If set to zero or |
return_table |
Should result of clustering be returned as a file, or a |
consensus_cluster |
Should consensus clustering be used? If |
use_fast_sort |
Should files be sorted using two files or in-place? If |
verbose |
Should status messages (output, progress, etc.) be displayed while running? Output messages are reduced if running in non-interactive mode. |
sep |
Character that separates entries on a line in each file in |
tempfiledir |
Character vector corresponding to the location where temporary files used during execution should be stored. Defaults to R's |
Very large graphs require too much RAM for processing on some machines. In a graph containing billions of nodes and edges, loading the entire structure into RAM is rarely feasible. This implementation uses disk space for storing representations of each graph. While this is slower than computing on RAM, it allows this algorithm to scale to graphs of enormous size while only using a comparatively small amount of memory. See "Memory Consumption" for details on the total disk/memory consumption of ExoLabel.
This function expects a set of edgelist files, provided as a vector of filepaths. Each entry in the file is expected to be in the following:
VERTEX1<sep>VERTEX2<sep>WEIGHT<linesep>
This line defines a single edge between vertices VERTEX1
and VERTEX2
with weight WEIGHT
. VERTEX1
and VERTEX2
are strings corresponding to vertex names, WEIGHT
is a numeric value that can be interpreted as a double
. The separators <sep>
and <linesep>
correspond to the arguments sep
and linesep
, respectively. The default arguments work for standard .tsv
formatting, i.e., a file of three columns of tab-separated values.
If ignore_weight=TRUE
, the file can be formatted as:
VERTEX1<sep>VERTEX2<linesep>
Note that the v1 v2 w
format is still accepted for ignore_weight=FALSE
, but the specified weights will be ignored.
Returns a list object with the parameters and result of the clustering. If using multiple clusterings, the return value is a list of lists, with each entry corresponding to the single-clustering case. This list has two entries, parameters
and results
.
parameters
is a named vector with the values of add_self_loops
used for the clustering. If more parameters are added in the future, they'll be included in this vector.
results
differs depending on the value of return_table
.
If return_table=TRUE
, results
is a data.frame
object with two columns. The first column contains the name of each vertex, and the second column contains the cluster it was assigned to.
If return_table=FALSE
, results
is a character vector of length 1. This vector contains the path to the file where clusters were written to. The file is formatted as a .tsv
, with each line containing two tab separated columns (vertex name, assigned cluster).
One of the main issues of Label Propagation algorithms is that they can fail to converge. Consider an unweighted directed graph with four nodes connected in a loop. That is, A->B, B->C, C->D, D->A
. If A,C
are in cluster 1 and B,D
are in cluster 2, this algorithm could keep processing all the nodes in a loop and never converge. To solve this issue, we introduce an additional measure for convergence controlled by iterations
. If iterations=x
, then we only allow the algorithm to process each node x
times. Once a given node has been seen x
times, it is no longer updated. This can be manually specified, but defaults to the square root of the largest node indegree.
Additionally, ExoLabel incorporates label hop attenuation to reduce the chance of a single massive cluster dominating results. In short, as a particular label propagates to other nodes, its subsequent contribution diminishes. The farther a particular label is from its original source, the less its contribution. The degree to which its contribution diminishes scales dynamically based on the proportion of nodes that update on each cycle. For more information, see Leung et al. (2009) linked in the References section.
A large portion of the processing time is reading in the graph object. This leads to a lot of duplicated effort when trying to cluster the same network with multiple parameters.
Multiple clusterings on the same network are supported by sending vectors of input to outfile
and add_self_loops
. If the length of outfile
is greater than 1, add_self_loops
can be set to either a single value or a vector of the same length as outfile
. For a single value, the same parameter value will be used across all clusterings. For multiple values, the corresponding value will be used in each clustering. See "Examples" for example usage.
Note that the order to process each node is randomly initialized, so multiple runs on the same parameters may differ due to this initialization step.
Consensus clustering can be enabled by setting consensus_cluster=TRUE
. Consensus clustering runs ExoLabel on the input graph multiple times, transforming weight values according to a sigmoid function. By default, this runs nine times for sigmoids with scale 0.5 and shape c(0,0.2,0.4,0.6,0.8,1.0,1.33,1.67,2.0)
, collapsing weights below 0.1 to zero. The resulting clusters form a network such that the edge weight between any two nodes connected in the initial graph is the proportion of clusters they shared over clustering runs. This network is used for a final label propagation run, which identifies the consensus clusters. Users can specify a numeric vector as input to consensus_cluster
, which will override the default shape parameters and number of iterations.
While this algorithm can scale to very large graphs, it does have some internal limitations. First, nodes must be comprised of no more than 254 characters. If this limitation is restrictive, please feel free to contact me. Alternatively, you can increase the size yourself by changing the definition of MAX_NODE_NAME_SIZE
in src/OnDiskLP.c
. This limitation is provided to decrease memory overhead and improve runtime, but arbitrary values are possible.
Second, nodes are indexed using 54-bit unsigned integers. This means that the maximum possible number of nodes available is 2^54-1, which is about 1.8 quadrillion. As with character limitations, feel free to contact me if this is too restrictive. Alternatively, you can decrease the size of BITS_FOR_WEIGHT
in src/OnDiskLP.c
, but note that this value determines how many bits to use to represent weights internally, so lower values will lead to more error.
Third, this algorithm uses disk space to store large objects. As such, please ensure you have sufficient disk space for the graph you intend to process. While there are safeguards in the code itself, unhandleable errors can occur when the OS runs out of space. Use EstimateExoLabel
to estimate the disk consumption of your graph, and see "Memory Consumption" for more details on how the total disk/memory consumption is calculated. Note that using use_fast_sort=TRUE
will double the maximal disk consumption of the algorithm.
Let v
be the number of unique nodes, d
the average indegree of nodes, and l
the average length of node labels. Note that the number of edges e
is equivalent to dv
.
Specific calculations for memory/disk consumption are detailed below. In summary, the absolute worst case memory consumption is roughly (24l+18)v
bytes, and the maximum disk consumption during computation is 16dv
bytes (or 32dv
bytes if use_fast_sort=TRUE
). The final table returned consumes (2+l+\log_{10}{v})v
bytes.
ExoLabel builds a trie to keep track of vertex names. Each internal node of the trie consumes 24 bytes, and each leaf node consumes 18 bytes. The lowest possible RAM consumption of the trie (if every label is length l
and shares the same prefix of length l-1
) is roughly 40v
bytes, and the maximum RAM consumption (if no two node labels have any prefix in common) is (24l + 18)v
bytes. We can generalize this to estimate the total memory consumption as roughly (24(l-p)+18)v
, where p
is the average length of common prefix between any two node labels.
ExoLabel also uses a number of internal caches to speed up read/writes from files. These caches take around 200MB of RAM in total. Note that this calculation does not include the RAM required for R itself, which is on the order of 300MB on my machine. It also uses an internal queue for processing nodes, which consumes roughly 10v
bytes, and an internal index of size 8v
bytes.
As for disk space, ExoLabel transforms the graph into a CSR-compressed network, which is split across three files: a header, a neighbors list, and a weights list. The header file contains v+1
entries of 8 bytes, and the other two files consume a total of 12 bytes per outgoing edge. The number of edges to record is vd
. Thus, the total disk consumption in bytes is 8(v+1) + 12vd \approx 8v+12dv
. However, the initial reading of the edges requires 16 bytes per edge, resulting in a maximum disk consumption of 16dv
(since d > 2
for most graphs). If use_fast_sort=TRUE
, this edge reading maximally consumes 32 bytes per edge (a maximum disk consumption of 32dv
).
The final table returned is a tab-separated table containing vertex names and cluster numbers in human-readable format. Each line consumes at most l + 2 + \log_{10}{v}
bytes. In the worst case, the number of clusters is equal to the number of vertices, which have \log_{10}{v}
digits. The average number of digits is close to the number of digits of the largest number due to how number of digits scale with numbers. The extra two bytes are for the tab and newline characters. Thus, the total size of the file is at most (2+l+\log_{10}{v})v
bytes. We remove all intermediate files prior to outputting clusters, so in practical cases this should be smaller than intermediate disk consumption.
Aidan Lakshman <AHL27@pitt.edu>
Traag, V.A., and L. Subelj. Large network community detection by fast label propagation. Sci. Rep., 2023. 13(2701). https://doi.org/10.1038/s41598-023-29610-z
Leung, X.Y.I., et al., Towards real-time community detection in large networks. Phys. Rev. E, 2009. 79(066107). https://doi.org/10.1103/PhysRevE.79.066107
EstimateExoLabel
num_verts <- 20L
num_edges <- 20L
all_verts <- sample(letters, num_verts)
all_edges <- vapply(seq_len(num_edges),
\(i) paste(c(sample(all_verts, 2L),
as.character(round(runif(1),3))),
collapse='\t'),
character(1L))
edgefile <- tempfile()
if(file.exists(edgefile)) file.remove(edgefile)
writeLines(all_edges, edgefile)
res <- ExoLabel(edgefile, return_table=TRUE)
print(res)
###########################
### Multiple Clustering ###
###########################
## Run with multiple add_self_loops values
tfs <- replicate(3, tempfile())
p2 <- ExoLabel(edgefile, tfs,
add_self_loops=c(0,0.5,1),
return_table = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.