Title :
Scaling clustering algorithms for massive data sets using data streams
Author :
Nittel, Silvia ; Leung, Kelvin T. ; Braverman, Amy
Author_Institution :
SIE, Maine Univ., Orono, ME, USA
fDate :
30 March-2 April 2004
Abstract :
Computing clustering techniques on massive data sets is still not feasible nor efficient today. For instance, raw satellite imagery data can be replaced with compressed counterparts for many scientific applications. However, to facilitate scientific data analysis the high order correlation between the attributes in the data set as well as their nonparametric distribution must be preserved in the reduced data set. Therefore, practical data reduction can be achieved by partitioning the overall data set via a coarse regular spatial grid, and compressing each grid cell individually by computing multivariate histograms or k-means clustering. Clustering spatial data in high dimensional spaces using k-means is expensive both with regard to computational costs and memory requirements. In a traditional k-means implementation all N data points belonging to a grid cell must be kept in memory to be clustered at a time, which often establishes a bottleneck for scientific data sets. Our objective is to define a clustering algorithm that scales automatically to any number of data points in a single grid cell, and provides high quality clustering results.
Keywords :
data analysis; data reduction; statistical analysis; visual databases; data reduction; data stream; k-means clustering; massive data sets; multivariate histogram; satellite imagery data; scientific data analysis; spatial data clustering; spatial data grid; Clustering algorithms; Computer science; Data analysis; Geoscience; Grid computing; Image coding; Kelvin; Laboratories; Propulsion; Satellites;
Conference_Titel :
Data Engineering, 2004. Proceedings. 20th International Conference on
Print_ISBN :
0-7695-2065-0
DOI :
10.1109/ICDE.2004.1320061