DocumentCode :
1665420
Title :
Efficient and Self-Balanced ROLLUP Aggregates for Large-Scale Data Summarization
Author :
Duy-Hung Phan ; Quang-Nhat Hoang-Xuan ; Dell´Amico, Matteo ; Michiardi, Pietro
Author_Institution :
EURECOM, France
fYear :
2015
Firstpage :
158
Lastpage :
165
Abstract :
Data summarization queries that compute aggregates by grouping datasets across several dimensions are essential to help users make sense of very large datasets. In this work, we focus on ROLLUP, an important operator that has been recently added to the Hadoop MapReduce ecosystem. However, its current implementation suffers from very large communication costs, leading to inefficient executions. We thus proceed with the design of a new ROLLUP operator for high-level languages. Our operator is self-optimizing, which means that it automatically performs load-balancing and determines a suitable operating point to achieve the highest performance. We have implemented our ROLLUP operator for Apache Pig, a popular high-level language in the Hadoop ecosystem. Our experimental results, obtained on both synthetic and real datasets, indicate that our new operator outperforms the current ROLLUP implementation in Pig by at least 50%.
Keywords :
data handling; parallel processing; resource allocation; Apache Pig; Hadoop MapReduce ecosystem; ROLLUP operator; communication cost; data summarization queries; high-level language; large-scale data summarization; load balancing; self-balanced ROLLUP aggregates; self-optimizing operator; Aggregates; Algorithm design and analysis; Clustering algorithms; Load modeling; Partitioning algorithms; Runtime; Tuning; MapReduce; ROLLUP; data summarization; optimization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (BigData Congress), 2015 IEEE International Congress on
Conference_Location :
New York, NY
Print_ISBN :
978-1-4673-7277-0
Type :
conf
DOI :
10.1109/BigDataCongress.2015.31
Filename :
7207215
Link To Document :
بازگشت