Finding connected components in map-reduce in logarithmic rounds

Author

Rastogi, V. ; Machanavajjhala, A. ; Chitnis, L. ; Das Sarma, Akash

Author_Institution

Google, Mountain View, CA, USA

fYear

2013

fDate

8-12 April 2013

Firstpage

50

Lastpage

61

Abstract

Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for map-reduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ (n|V| + |E|), communication per round. We propose here two efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | + |E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for Hash-to-Min is novel, but not tight, and it is actually faster than Hash-Greater-to-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V| + |E|) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard map-reduce feature, we scale Hash-to-Min to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first map-reduce algorithm for clustering in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets.

Keywords

computational complexity; file organisation; graph theory; pattern clustering; randomised algorithms; sorting; PRAM technique; agglomerative single linkage clustering; connected component; graph diameter; hash-greater-to-min; logarithmic rounds; map-reduce algorithm; map-reduce rounds; proof technique; randomized algorithm; real-world dataset; secondary sorting; Clustering algorithms; Complexity theory; Convergence; Couplings; Merging; Phase change random access memory; Vegetation;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering (ICDE), 2013 IEEE 29th International Conference on

Conference_Location

Brisbane, QLD

ISSN

1063-6382

Print_ISBN

978-1-4673-4909-3

Electronic_ISBN

1063-6382

Type

conf

DOI

10.1109/ICDE.2013.6544813

Filename

6544813