Scaling clustering algorithms for massive data sets using data streams

Author

Nittel, Silvia ; Leung, Kelvin T. ; Braverman, Amy

Author_Institution

SIE, Maine Univ., Orono, ME, USA

fYear

2004

fDate

30 March-2 April 2004

Firstpage

830

Abstract

Computing clustering techniques on massive data sets is still not feasible nor efficient today. For instance, raw satellite imagery data can be replaced with compressed counterparts for many scientific applications. However, to facilitate scientific data analysis the high order correlation between the attributes in the data set as well as their nonparametric distribution must be preserved in the reduced data set. Therefore, practical data reduction can be achieved by partitioning the overall data set via a coarse regular spatial grid, and compressing each grid cell individually by computing multivariate histograms or k-means clustering. Clustering spatial data in high dimensional spaces using k-means is expensive both with regard to computational costs and memory requirements. In a traditional k-means implementation all N data points belonging to a grid cell must be kept in memory to be clustered at a time, which often establishes a bottleneck for scientific data sets. Our objective is to define a clustering algorithm that scales automatically to any number of data points in a single grid cell, and provides high quality clustering results.

Keywords

data analysis; data reduction; statistical analysis; visual databases; data reduction; data stream; k-means clustering; massive data sets; multivariate histogram; satellite imagery data; scientific data analysis; spatial data clustering; spatial data grid; Clustering algorithms; Computer science; Data analysis; Geoscience; Grid computing; Image coding; Kelvin; Laboratories; Propulsion; Satellites;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering, 2004. Proceedings. 20th International Conference on

ISSN

1063-6382

Print_ISBN

0-7695-2065-0

Type

conf

DOI

10.1109/ICDE.2004.1320061

Filename

1320061