DocumentCode
2369171
Title
Scalable model-based clustering by working on data summaries
Author
Jin, Huidong ; Wong, Man-Leung ; Leung, Kwong-Sak
Author_Institution
Dept. of Inf. Syst., Lingnan Univ., Tuen Mun, China
fYear
2003
fDate
19-22 Nov. 2003
Firstpage
91
Lastpage
98
Abstract
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. We present a two-phase scalable model-based clustering framework: first, a large data set is summed up into subclusters; Then, clusters are directly generated from the summary statistics of subclusters by a specifically designed expectation-maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each subcluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
Keywords
Gaussian processes; computational complexity; covariance analysis; data mining; pattern clustering; very large databases; EM; Gaussian mixture model; bEMADS; clustering system; covariance information; data mining; data summarization procedures; expectation-maximization algorithm; gEMADS; large databases; two-phase scalable model-based clustering; Algorithm design and analysis; Bridges; Clustering algorithms; Data mining; Databases; Explosives; Information systems; Iterative algorithms; Scalability; Statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN
0-7695-1978-4
Type
conf
DOI
10.1109/ICDM.2003.1250907
Filename
1250907
Link To Document