DocumentCode :
1159842
Title :
Parallel and distributed methods for incremental frequent itemset mining
Author :
Otey, Matthew Eric ; Parthasarathy, Srinivasan ; Wang, Chao ; Veloso, Adriano ; Meira, Wagner, Jr.
Author_Institution :
Comput. & Inf. Sci. Dept., Ohio State Univ., Columbus, OH, USA
Volume :
34
Issue :
6
fYear :
2004
Firstpage :
2439
Lastpage :
2450
Abstract :
Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites.
Keywords :
data mining; distributed algorithms; distributed databases; grid computing; parallel processing; data mining; distributed asynchronous algorithm; distributed dynamic datasets; distributed method; grid computing; incremental frequent itemset mining; parallel method; Chaotic communication; Concurrent computing; Context; Data mining; Distributed computing; Distributed databases; Grid computing; Heuristic algorithms; Itemsets; Scalability; Distributed computing; grid computing; incremental data mining; parallel computing; Algorithms; Artificial Intelligence; Computer Communication Networks; Computing Methodologies; Database Management Systems; Databases, Factual; Information Dissemination; Information Storage and Retrieval; Pattern Recognition, Automated;
fLanguage :
English
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
Publisher :
ieee
ISSN :
1083-4419
Type :
jour
DOI :
10.1109/TSMCB.2004.836887
Filename :
1356035
Link To Document :
بازگشت