DocumentCode :
734234
Title :
A Scalable Hierarchical Clustering Algorithm Using Spark
Author :
Chen Jin ; Ruoqian Liu ; Hendrix, William ; Agrawal, Ankit ; Choudhary, Alok ; Zhengzhang Chen
Author_Institution :
Northwestern Univ., Evanston, IL, USA
fYear :
2015
fDate :
March 30 2015-April 2 2015
Firstpage :
418
Lastpage :
426
Abstract :
Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon´s cloud environment. And a thorough performance evaluation in Amazon´s EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.
Keywords :
cloud computing; data mining; data reduction; iterative methods; parallel processing; pattern clustering; trees (mathematics); Amazon EC2; Amazon cloud environment; Spark; data category; data dependency; data mining; hierarchical tree construction; iterative process; minimum spanning tree problem; performance evaluation; potential group structures; redundancy reduction; scalable hierarchical clustering algorithm; single-linkage hierarchical clustering algorithm; Algorithm design and analysis; Clustering algorithms; Data mining; Data models; Partitioning algorithms; Scalability; Sparks; Hierarchical Clustering; Minimum Spanning Tree; Spark;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on
Conference_Location :
Redwood City, CA
Type :
conf
DOI :
10.1109/BigDataService.2015.67
Filename :
7184911
Link To Document :
بازگشت