DocumentCode :
3434985
Title :
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Author :
Kalavri, Vasiliki ; Brundza, Vaidas ; Vlassov, Vladimir
Author_Institution :
KTH R. Inst. of Technol., Stockholm, Sweden
Volume :
1
fYear :
2013
fDate :
2-5 Dec. 2013
Firstpage :
250
Lastpage :
257
Abstract :
Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP\´s early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.
Keywords :
data analysis; parallel programming; HDFS block sampling technique; HOP accuracy improvement; Hadoop MapReduce; Hadoop Online Prototype; absolute error; average temperature estimation; data analysis; data bias reduction; in-memory shuffling; input subset; job completion; large-scale data processing frameworks; minimal overhead; online aggregation; partial-job execution; progress metrics; response time; Accuracy; Approximation methods; Data processing; Estimation; Meteorology; Prototypes; Temperature measurement; MapReduce; approximate results; online aggregation; sampling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on
Conference_Location :
Bristol
Type :
conf
DOI :
10.1109/CloudCom.2013.40
Filename :
6753805
Link To Document :
بازگشت