Title :
Scalable single linkage hierarchical clustering for big data
Author :
Havens, Timothy C. ; Bezdek, James C. ; Palaniswami, Marimuthu
Author_Institution :
Electr. & Comput. Eng. Dept., Michigan Technol. Univ., Houghton, MI, USA
Abstract :
Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets-the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (sVAT) algorithm to return single-linkage partitions of big data sets. The sVAT algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for sVAT enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of sVAT are (usually) significantly less than the O(n2) requirement of the classic single-linkage hierarchical algorithm. We show that sVAT is a scalable instantiation of single-linkage clustering for data sets that contain c compact-separated clusters, where c ≪ n; n is the number of objects. For data sets that do not contain compact-separated clusters, we show that sVAT produces a good approximation of single-linkage partitions. Experimental results are presented for both synthetic and real data sets.
Keywords :
computational complexity; pattern clustering; personal computing; storage management; Facebook; Library of Congress; Web data; big data sets; compact-separated clusters; computational complexity; large data sets; personal computing technologies; sVAT algorithm; scalable single linkage hierarchical clustering; scalable visual assessment of tendency algorithm; single-linkage partitions; storage requirements; Algorithm design and analysis; Approximation algorithms; Big data; Clustering algorithms; Partitioning algorithms; Vectors; Visualization;
Conference_Titel :
Intelligent Sensors, Sensor Networks and Information Processing, 2013 IEEE Eighth International Conference on
Conference_Location :
Melbourne, VIC
Print_ISBN :
978-1-4673-5499-8
DOI :
10.1109/ISSNIP.2013.6529823