DocumentCode :
167640
Title :
YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark
Author :
Hongjian Qiu ; Rong Gu ; Chunfeng Yuan ; Yihua Huang
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
fYear :
2014
fDate :
19-23 May 2014
Firstpage :
1664
Lastpage :
1671
Abstract :
The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays, on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The parallel and distributed computing is effective and mostly-used strategy for speeding up large scale dataset algorithms. However, the existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In this paper, we proposed YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework -- a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining. Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved 18× speedup in average for various benchmarks. Especially, we apply YAFIM in a real-world medical application to explore the relationships in medicine. It outperforms the MapReduce method around 25 times.
Keywords :
data mining; iterative methods; parallel algorithms; FIM process; MapReduce; Spark RDD framework; YAFIM; computing-intensive; data-intensive; distributed computing; in-memory parallel computing model; interactive data mining; iterative algorithms; knowledge extraction; large scale data sets; large scale dataset algorithms; parallel apriori algorithm; parallel frequent itemset mining algorithm; real-world applications; transactional dataset; yet another frequent itemset mining; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Computational modeling; Data mining; Itemsets; Sparks; Apriori Algorithm; Frequent Itemset Mining; Medical Application; Parallel Computing; Spark;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-4117-9
Type :
conf
DOI :
10.1109/IPDPSW.2014.185
Filename :
6969575
Link To Document :
بازگشت