Title :
A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets
Author :
Jiangtao Yin ; Yong Liao ; Baldi, Mario ; Lixin Gao ; Nucci, Antonio
Abstract :
One of the most common datasets used by many corporations to gain business intelligence is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by user ID with the temporal ordering preserved to facilitate mining user behaviors. This kind of analytical workload, here referred to as Relative Order-preserving based Grouping (RE-ORG), is quite common in big data analytics. Using MapReduce/Hadoop for executing RE-ORG tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. In this paper, we propose a distributed framework that adopts an efficient group-order-merge mechanism to provide faster execution of RE-ORG tasks. We demonstrate the advantage of our framework by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3× speedup over Hadoop in executing RE-ORG tasks.
Keywords :
Big Data; competitive intelligence; data analysis; distributed databases; real-time systems; MapReduce/Hadoop; RE-ORG; big data analytics; business intelligence; event log files; group-order-merge mechanism; ordered dataset analytics; real-world datasets; relative order-preserving based grouping; scalable distributed framework; temporal ordering; user behavior mining; Business; Indexes; Instruments; Merging; Open source software; Programming; Sorting; Hadoop; MapReduce; big data analytics; distributed framework; ordered dataset;
Conference_Titel :
Utility and Cloud Computing (UCC), 2013 IEEE/ACM 6th International Conference on
Conference_Location :
Dresden
DOI :
10.1109/UCC.2013.35