Author :
Kalavri, Vasiliki ; Brundza, Vaidas ; Vlassov, Vladimir
Abstract :
Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP\´s early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.
Keywords :
data analysis; parallel programming; HDFS block sampling technique; HOP accuracy improvement; Hadoop MapReduce; Hadoop Online Prototype; absolute error; average temperature estimation; data analysis; data bias reduction; in-memory shuffling; input subset; job completion; large-scale data processing frameworks; minimal overhead; online aggregation; partial-job execution; progress metrics; response time; Accuracy; Approximation methods; Data processing; Estimation; Meteorology; Prototypes; Temperature measurement; MapReduce; approximate results; online aggregation; sampling;