DocumentCode
2028055
Title
A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets
Author
Jiangtao Yin ; Yong Liao ; Baldi, Mario ; Lixin Gao ; Nucci, Antonio
fYear
2013
fDate
9-12 Dec. 2013
Firstpage
131
Lastpage
138
Abstract
One of the most common datasets used by many corporations to gain business intelligence is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by user ID with the temporal ordering preserved to facilitate mining user behaviors. This kind of analytical workload, here referred to as Relative Order-preserving based Grouping (RE-ORG), is quite common in big data analytics. Using MapReduce/Hadoop for executing RE-ORG tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. In this paper, we propose a distributed framework that adopts an efficient group-order-merge mechanism to provide faster execution of RE-ORG tasks. We demonstrate the advantage of our framework by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3× speedup over Hadoop in executing RE-ORG tasks.
Keywords
Big Data; competitive intelligence; data analysis; distributed databases; real-time systems; MapReduce/Hadoop; RE-ORG; big data analytics; business intelligence; event log files; group-order-merge mechanism; ordered dataset analytics; real-world datasets; relative order-preserving based grouping; scalable distributed framework; temporal ordering; user behavior mining; Business; Indexes; Instruments; Merging; Open source software; Programming; Sorting; Hadoop; MapReduce; big data analytics; distributed framework; ordered dataset;
fLanguage
English
Publisher
ieee
Conference_Titel
Utility and Cloud Computing (UCC), 2013 IEEE/ACM 6th International Conference on
Conference_Location
Dresden
Type
conf
DOI
10.1109/UCC.2013.35
Filename
6809349
Link To Document