Title :
Dataset Scaling and MapReduce Performance
Author :
Fan Zhang ; Sakr, Majd
Author_Institution :
Dept. of Comput. Sci., Carnegie Mellon Univ. in Qatar, Doha, Qatar
Abstract :
Predicting execution behavior of MapReduce applications when scaling the input dataset presents a challenging problem. The difficulty lies in the distributed locations of input data and the distributed, virtualized compute resources that utilize different network substrates. The potential payoff lies in using small datasets and limited test runs to understand how applications will behave with "big data." Our research has developed an in-depth understanding of MapReduce application performance and analyzed the impact of scaling input datasets. While we might expect that "embarrassingly parallel" MapReduce jobs should scale linearly with input dataset size, our results show that execution time sometimes increases nonlinearly. To verify our predictions, we identify a benchmark set of Map-, Shuffle-, and Reduce-intensive applications. Experimental results show that our execution-time analysis distinguishes four typical application behaviors when scaling input datasets.
Keywords :
benchmark testing; parallel processing; software performance evaluation; virtualisation; MapReduce application execution behavior prediction; MapReduce application performance; MapReduce jobs; dataset scaling; distributed data locations; execution-time analysis; map-intensive applications; reduce-intensive applications; shuffle-intensive applications; virtualized compute resources; Analytical models; Benchmark testing; Computational modeling; Mathematical model; Parallel processing; Scalability; TV; Cloud computing; MapReduce applications; dataset size; input scaling; parallel computing;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
DOI :
10.1109/IPDPSW.2013.143