Scalable model-based clustering by working on data summaries

Author

Jin, Huidong ; Wong, Man-Leung ; Leung, Kwong-Sak

Author_Institution

Dept. of Inf. Syst., Lingnan Univ., Tuen Mun, China

fYear

2003

fDate

19-22 Nov. 2003

Firstpage

91

Lastpage

98

Abstract

The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. We present a two-phase scalable model-based clustering framework: first, a large data set is summed up into subclusters; Then, clusters are directly generated from the summary statistics of subclusters by a specifically designed expectation-maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each subcluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.

Keywords

Gaussian processes; computational complexity; covariance analysis; data mining; pattern clustering; very large databases; EM; Gaussian mixture model; bEMADS; clustering system; covariance information; data mining; data summarization procedures; expectation-maximization algorithm; gEMADS; large databases; two-phase scalable model-based clustering; Algorithm design and analysis; Bridges; Clustering algorithms; Data mining; Databases; Explosives; Information systems; Iterative algorithms; Scalability; Statistics;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2003. ICDM 2003. Third IEEE International Conference on

Print_ISBN

0-7695-1978-4

Type

conf

DOI

10.1109/ICDM.2003.1250907

Filename

1250907