ExoLabel: ExoLabel: Out of Memory Fast Label Propagation
In npcooley/SynExtend: Tools for Working With Synteny Objects

ExoLabel

R Documentation

ExoLabel: Out of Memory Fast Label Propagation

Description

Runs Fast Label Propagation using disk space for constant memory complexity.

Usage

ExoLabel(edgelistfiles,
              outfile=tempfile(),
              mode=c("undirected", "directed"),
              add_self_loops=FALSE,
              ignore_weights=FALSE,
              iterations=0L,
              return_table=FALSE,
              consensus_cluster=FALSE,
              use_fast_sort=FALSE,
              verbose=interactive(),
              sep='\t',
              tempfiledir=tempdir())

Arguments

`edgelistfiles`	Character vector of files to be processed. Each entry should be a machine-interpretable path to an edgelist file. See Details for expected format.
`outfile`	File to write final clusters to. Optional, defaults to a temporary file. Can be set to a vector of filepaths to run multiple clusterings (see Details).
`mode`	String specifying whether edges should be interpreted as undirected (default) or directed. Can be "undirected", "directed", or an unambiguous abbreviation.
`add_self_loops`	Should self loops be added to the network? If `TRUE`, adds self loops of weight 1.0 to all vertices. If set to numeric value `w`, adds self loops of weight `w` to all nodes. Can also be set to a vector when running multiple clusterings (see Details).
`ignore_weights`	Should weights be ignored? If `TRUE`, all edges will be treated as an edge of weight 1. Must be set to `TRUE` if any of `edgelistfiles` are two-column tables (start->end only, lacking a weights column).
`iterations`	Maximum number of times to process each node. If set to zero or `NULL`, automatically uses the square root of the max node degree. See "Algorithm Convergence" for more information.
`return_table`	Should result of clustering be returned as a file, or a `data.frame` object? If `FALSE`, returns a character vector corresponding to the path of `outfile`. If `TRUE`, parses `outfile` using `read.table` and returns the result. Not recommended for very large graphs.
`consensus_cluster`	Should consensus clustering be used? If `TRUE`, runs the clustering algorithm multiple times and forms a consensus clustering based on the agreement of each run. Can be set to a vector of doubles to control the number of iterations. See "Consensus Clustering" below for more information.
`use_fast_sort`	Should files be sorted using two files or in-place? If `TRUE`, ExoLabel will perform its file sorting functions using a second temporary file. This is faster than the in-place sort, but consumes twice the amount of disk space. The relative disk consumption is about the same size as the input graph for `use_fast_sort=FALSE`, and about double the size of the input graph for `use_fast_sort=TRUE` (see "Memory Consumption" and the last paragraph of "Warning" below). Set to `TRUE` if you're not worried about disk utilization.
`verbose`	Should status messages (output, progress, etc.) be displayed while running? Output messages are reduced if running in non-interactive mode.
`sep`	Character that separates entries on a line in each file in `edgelistfiles`. Defaults to tab, as would be expected in a `.tsv` formatted file. Set to `','` for a `.csv` file.
`tempfiledir`	Character vector corresponding to the location where temporary files used during execution should be stored. Defaults to R's `tempdir`.

Details

Very large graphs require too much RAM for processing on some machines. In a graph containing billions of nodes and edges, loading the entire structure into RAM is rarely feasible. This implementation uses disk space for storing representations of each graph. While this is slower than computing on RAM, it allows this algorithm to scale to graphs of enormous size while only using a comparatively small amount of memory. See "Memory Consumption" for details on the total disk/memory consumption of ExoLabel.

This function expects a set of edgelist files, provided as a vector of filepaths. Each entry in the file is expected to be in the following:

VERTEX1<sep>VERTEX2<sep>WEIGHT<linesep>

This line defines a single edge between vertices VERTEX1 and VERTEX2 with weight WEIGHT. VERTEX1 and VERTEX2 are strings corresponding to vertex names, WEIGHT is a numeric value that can be interpreted as a double. The separators <sep> and <linesep> correspond to the arguments sep and linesep, respectively. The default arguments work for standard .tsv formatting, i.e., a file of three columns of tab-separated values.

If ignore_weight=TRUE, the file can be formatted as:

VERTEX1<sep>VERTEX2<linesep>

Note that the v1 v2 w format is still accepted for ignore_weight=FALSE, but the specified weights will be ignored.

Value

Returns a list object with the parameters and result of the clustering. If using multiple clusterings, the return value is a list of lists, with each entry corresponding to the single-clustering case. This list has two entries, parameters and results.

parameters is a named vector with the values of add_self_loops used for the clustering. If more parameters are added in the future, they'll be included in this vector.

results differs depending on the value of return_table.

If return_table=TRUE, results is a data.frame object with two columns. The first column contains the name of each vertex, and the second column contains the cluster it was assigned to.

If return_table=FALSE, results is a character vector of length 1. This vector contains the path to the file where clusters were written to. The file is formatted as a .tsv, with each line containing two tab separated columns (vertex name, assigned cluster).

Algorithm Convergence

One of the main issues of Label Propagation algorithms is that they can fail to converge. Consider an unweighted directed graph with four nodes connected in a loop. That is, A->B, B->C, C->D, D->A. If A,C are in cluster 1 and B,D are in cluster 2, this algorithm could keep processing all the nodes in a loop and never converge. To solve this issue, we introduce an additional measure for convergence controlled by iterations. If iterations=x, then we only allow the algorithm to process each node x times. Once a given node has been seen x times, it is no longer updated. This can be manually specified, but defaults to the square root of the largest node indegree.

Additionally, ExoLabel incorporates label hop attenuation to reduce the chance of a single massive cluster dominating results. In short, as a particular label propagates to other nodes, its subsequent contribution diminishes. The farther a particular label is from its original source, the less its contribution. The degree to which its contribution diminishes scales dynamically based on the proportion of nodes that update on each cycle. For more information, see Leung et al. (2009) linked in the References section.

Multiple Clusterings

A large portion of the processing time is reading in the graph object. This leads to a lot of duplicated effort when trying to cluster the same network with multiple parameters.

Multiple clusterings on the same network are supported by sending vectors of input to outfile and add_self_loops. If the length of outfile is greater than 1, add_self_loops can be set to either a single value or a vector of the same length as outfile. For a single value, the same parameter value will be used across all clusterings. For multiple values, the corresponding value will be used in each clustering. See "Examples" for example usage.

Note that the order to process each node is randomly initialized, so multiple runs on the same parameters may differ due to this initialization step.

Consensus Clustering

Consensus clustering can be enabled by setting consensus_cluster=TRUE. Consensus clustering runs ExoLabel on the input graph multiple times, transforming weight values according to a sigmoid function. By default, this runs nine times for sigmoids with scale 0.5 and shape c(0,0.2,0.4,0.6,0.8,1.0,1.33,1.67,2.0), collapsing weights below 0.1 to zero. The resulting clusters form a network such that the edge weight between any two nodes connected in the initial graph is the proportion of clusters they shared over clustering runs. This network is used for a final label propagation run, which identifies the consensus clusters. Users can specify a numeric vector as input to consensus_cluster, which will override the default shape parameters and number of iterations.

Warning

While this algorithm can scale to very large graphs, it does have some internal limitations. First, nodes must be comprised of no more than 254 characters. If this limitation is restrictive, please feel free to contact me. Alternatively, you can increase the size yourself by changing the definition of MAX_NODE_NAME_SIZE in src/OnDiskLP.c. This limitation is provided to decrease memory overhead and improve runtime, but arbitrary values are possible.

Second, nodes are indexed using 54-bit unsigned integers. This means that the maximum possible number of nodes available is 2^54-1, which is about 1.8 quadrillion. As with character limitations, feel free to contact me if this is too restrictive. Alternatively, you can decrease the size of BITS_FOR_WEIGHT in src/OnDiskLP.c, but note that this value determines how many bits to use to represent weights internally, so lower values will lead to more error.

Third, this algorithm uses disk space to store large objects. As such, please ensure you have sufficient disk space for the graph you intend to process. While there are safeguards in the code itself, unhandleable errors can occur when the OS runs out of space. Use EstimateExoLabel to estimate the disk consumption of your graph, and see "Memory Consumption" for more details on how the total disk/memory consumption is calculated. Note that using use_fast_sort=TRUE will double the maximal disk consumption of the algorithm.

Memory Consumption

Let v be the number of unique nodes, d the average indegree of nodes, and l the average length of node labels. Note that the number of edges e is equivalent to dv.

Specific calculations for memory/disk consumption are detailed below. In summary, the absolute worst case memory consumption is roughly (24l+18)v bytes, and the maximum disk consumption during computation is 16dv bytes (or 32dv bytes if use_fast_sort=TRUE). The final table returned consumes (2+l+\log_{10}{v})v bytes.

ExoLabel builds a trie to keep track of vertex names. Each internal node of the trie consumes 24 bytes, and each leaf node consumes 18 bytes. The lowest possible RAM consumption of the trie (if every label is length l and shares the same prefix of length l-1) is roughly 40v bytes, and the maximum RAM consumption (if no two node labels have any prefix in common) is (24l + 18)v bytes. We can generalize this to estimate the total memory consumption as roughly (24(l-p)+18)v, where p is the average length of common prefix between any two node labels.

ExoLabel also uses a number of internal caches to speed up read/writes from files. These caches take around 200MB of RAM in total. Note that this calculation does not include the RAM required for R itself, which is on the order of 300MB on my machine. It also uses an internal queue for processing nodes, which consumes roughly 10v bytes, and an internal index of size 8v bytes.

As for disk space, ExoLabel transforms the graph into a CSR-compressed network, which is split across three files: a header, a neighbors list, and a weights list. The header file contains v+1 entries of 8 bytes, and the other two files consume a total of 12 bytes per outgoing edge. The number of edges to record is vd. Thus, the total disk consumption in bytes is 8(v+1) + 12vd \approx 8v+12dv. However, the initial reading of the edges requires 16 bytes per edge, resulting in a maximum disk consumption of 16dv (since d > 2 for most graphs). If use_fast_sort=TRUE, this edge reading maximally consumes 32 bytes per edge (a maximum disk consumption of 32dv).

The final table returned is a tab-separated table containing vertex names and cluster numbers in human-readable format. Each line consumes at most l + 2 + \log_{10}{v} bytes. In the worst case, the number of clusters is equal to the number of vertices, which have \log_{10}{v} digits. The average number of digits is close to the number of digits of the largest number due to how number of digits scale with numbers. The extra two bytes are for the tab and newline characters. Thus, the total size of the file is at most (2+l+\log_{10}{v})v bytes. We remove all intermediate files prior to outputting clusters, so in practical cases this should be smaller than intermediate disk consumption.

Author(s)

Aidan Lakshman <AHL27@pitt.edu>

References

Traag, V.A., and L. Subelj. Large network community detection by fast label propagation. Sci. Rep., 2023. 13(2701). https://doi.org/10.1038/s41598-023-29610-z

Leung, X.Y.I., et al., Towards real-time community detection in large networks. Phys. Rev. E, 2009. 79(066107). https://doi.org/10.1103/PhysRevE.79.066107

Examples

num_verts <- 20L
num_edges <- 20L
all_verts <- sample(letters, num_verts)
all_edges <- vapply(seq_len(num_edges),
      \(i) paste(c(sample(all_verts, 2L),
                   as.character(round(runif(1),3))),
                 collapse='\t'),
                    character(1L))
edgefile <- tempfile()
if(file.exists(edgefile)) file.remove(edgefile)
writeLines(all_edges, edgefile)
res <- ExoLabel(edgefile, return_table=TRUE)
print(res)

###########################
### Multiple Clustering ###
###########################
## Run with multiple add_self_loops values
tfs <- replicate(3, tempfile())
p2 <- ExoLabel(edgefile, tfs,
                add_self_loops=c(0,0.5,1),
                return_table = TRUE)

npcooley/SynExtend documentation built on Feb. 28, 2025, 5:45 p.m.

npcooley/SynExtend index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

npcooley/SynExtend
Tools for Working With Synteny Objects

ExoLabel: ExoLabel: Out of Memory Fast Label Propagation
In npcooley/SynExtend: Tools for Working With Synteny Objects

ExoLabel: Out of Memory Fast Label Propagation

Description

Usage

Arguments

Details

Value

Algorithm Convergence

Multiple Clusterings

Consensus Clustering

Warning

Memory Consumption

Author(s)

References

See Also

Examples

Related to ExoLabel in npcooley/SynExtend...

R Package Documentation

Browse R Packages

We want your feedback!

npcooley/SynExtend Tools for Working With Synteny Objects

ExoLabel: ExoLabel: Out of Memory Fast Label Propagation In npcooley/SynExtend: Tools for Working With Synteny Objects

ExoLabel: Out of Memory Fast Label Propagation

Description

Usage

Arguments

Details

Value

Algorithm Convergence

Multiple Clusterings

Consensus Clustering

Warning

Memory Consumption

Author(s)

References

See Also

Examples

Related to ExoLabel in npcooley/SynExtend...

R Package Documentation

Browse R Packages

We want your feedback!

npcooley/SynExtend
Tools for Working With Synteny Objects

ExoLabel: ExoLabel: Out of Memory Fast Label Propagation
In npcooley/SynExtend: Tools for Working With Synteny Objects