• DocumentCode
    3588659
  • Title

    MRTune: A simulator for performance tuning of MapReduce jobs with skewed data

  • Author

    Xibo Zhou ; Wuman Luo ; Haoyu Tan

  • Author_Institution
    Guangzhou HKUST Fok Ying Tung Res. Inst., Hong Kong Univ. of Sci. & Technol., Hong Kong, China
  • fYear
    2014
  • Firstpage
    352
  • Lastpage
    359
  • Abstract
    MapReduce is a programming model designed by Google that has been widely used for both high performance computing and big data processing. Although the programming model is simple, it is very challenging to conduct performance tuning for a MapReduce job, considering the complexities of the configuration parameters and various tradeoffs between the performance gain of the optimization approaches and the extra overhead they bring about. One naive way to address this issue is to run the MapReduce jobs repeatedly using different combinations of configuration parameters and optimization methods, then select the one with the shortest running time. However, real execution is impractical because the combinations may be too many and the time of one run of each combination may be too long. Therefore, it is desirable if we can efficiently estimate the runtime of a job without real execution using only the input data and the configuration parameter settings of the cluster. In this paper, we propose a novel MapReduce simulator called MRTune for runtime estimation of MapReduce jobs. MRTune takes the key distribution of input data into consideration and can work well even when the key distribution of data is skewed. Moreover, MRTune can estimate the runtime of a job in the presence of unpredictable task failures. We evaluate MRTune implementing MapReduce jobs with Zipfian distributed input data. The result shows that MRTune can estimate the runtime of MapReduce jobs with high accuracy and efficiency while the key distribution of input data is skewed. We also conduct two case studies to analyse the impact of data skew and task failures on a MapReduce job.
  • Keywords
    data handling; parallel programming; software performance evaluation; Google; MRTune; MRTune implementation; MapReduce jobs; Reduce simulator; Zipfian distributed input data; configuration parameter complexities; job runtime estimation; optimization approach; overhead; performance gain; performance tuning; programming model; skewed data; unpredictable task failures; Complexity theory; Computational modeling; Data models; Tuning; MapReduce; performance tuning; runtime estimation; simulator; skew;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/PADSW.2014.7097828
  • Filename
    7097828