DocumentCode :
244546
Title :
A parallel algorithm for approximate frequent itemset mining using MapReduce
Author :
Fumarola, Fabio ; Malerba, Donato
Author_Institution :
Dept. of Comput. Sci., Univ. of Bari “Aldo Moro”, Bari, Italy
fYear :
2014
fDate :
21-25 July 2014
Firstpage :
335
Lastpage :
342
Abstract :
Recently, several algorithms based on the MapReduce framework have been proposed for frequent pattern mining in Big Data. However, the proposed solutions come with their own technical challenges, such as inter-communication costs, in-process synchronizations, balanced data distribution and input parameters tuning, which negatively affect the computation time. In this paper we present MrAdam, a novel parallel, distributed algorithm which addresses these problems. The key principle underlying the design of MrAdam is that one can make reasonable decisions in the absence of perfect answers. Indeed, given the classical threshold for minimum support and a user-specified error bound, MrAdam exploits the Chernoff bound to mine “approximate” frequent itemsets with statistical error guarantees on their actual supports. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The result collections of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. MrAdam can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments we also show that the proposed algorithm returns a good statistically bounded approximation of the exact results.
Keywords :
Big Data; approximation theory; data mining; parallel algorithms; statistical analysis; Big Data; Chernoff bound; MapReduce framework; MapReduce parallel computation framework; MrAdam; approximate frequent itemset mining; balanced data distribution; distributed algorithm; frequent pattern mining; in-process synchronizations; input parameters tuning; intercommunication costs; parallel algorithm; statistical error; statistically bounded approximation; Approximation algorithms; Approximation methods; Data mining; Itemsets; Mathematical model; Reliability; Chernoff Bound; Frequent Itemset Mining; Map-Reduce;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing & Simulation (HPCS), 2014 International Conference on
Conference_Location :
Bologna
Print_ISBN :
978-1-4799-5312-7
Type :
conf
DOI :
10.1109/HPCSim.2014.6903705
Filename :
6903705
Link To Document :
بازگشت