DocumentCode
745205
Title
Effect of data skewness and workload balance in parallel data mining
Author
Cheung, David W. ; Lee, Sau D. ; Xiao, Yongqiao
Author_Institution
Dept. of Comput. Sci. & Inf. Syst., Hong Kong Univ., China
Volume
14
Issue
3
fYear
2002
Firstpage
498
Lastpage
514
Abstract
To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules (Cheung et al., 1996). FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal (Agrawal and Srikant, 1994). The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently
Keywords
data mining; parallel algorithms; parallel databases; resource allocation; software metrics; very large databases; FDM algorithm; association rule mining; candidate sets pruning techniques; data skewness; distributed pruning; distributed share-nothing parallel system; experiments; global pruning; large databases; message exchange; parallel algorithm; parallel data mining; random partitioning; response time; software metrics; workload balance; Data mining;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2002.1000339
Filename
1000339
Link To Document