DocumentCode
1196983
Title
P-AutoClass: scalable parallel clustering for mining large data sets
Author
Pizzuti, Clara ; Talia, Domenico
Author_Institution
Inst. of High Performance Comput. & Networking, Italian Nat. Res. Council, Rende, Italy
Volume
15
Issue
3
fYear
2003
Firstpage
629
Lastpage
641
Abstract
Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.
Keywords
Bayes methods; data analysis; data mining; multiprocessing systems; pattern clustering; software performance evaluation; unsupervised learning; very large databases; Bayesian model; P-AutoClass; data analysis; data clustering; experimental performance results; large data set mining; large data sets; multicomputer; scalable parallel clustering; scalable parallel computers; unsupervised classification; Algorithm design and analysis; Bayesian methods; Clustering algorithms; Clustering methods; Computer Society; Concurrent computing; Data mining; Parallel processing; Partitioning algorithms; Scalability;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2003.1198395
Filename
1198395
Link To Document