O-Cluster: scalable clustering of large high dimensional data sets

Author

Milenova, Boriana L. ; Campos, Marcos M.

Author_Institution

Oracle Data Min. Technol., Burlington, MA, USA

fYear

2002

fDate

2002

Firstpage

290

Lastpage

297

Abstract

Clustering large data sets of high dimensionality has always been a challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with a very large number of records and/or with a very high number of dimensions. We provide a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the "curse of dimensionality" and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. O-Cluster combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster\´s excellent scalability.

Keywords

computational complexity; data mining; pattern clustering; very large databases; O-Cluster; active sampling technique; axis-parallel partitioning strategy; complexity; data handling; data mining; large high dimensional data sets; limited memory buffer; multidimensional data sets; scalability; scalable clustering; Clustering algorithms; Computational complexity; Data mining; Information retrieval; Multidimensional systems; Multimedia databases; Partitioning algorithms; Sampling methods; Scalability; Shape;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on

Print_ISBN

0-7695-1754-4

Type

conf

DOI

10.1109/ICDM.2002.1183915

Filename

1183915