Title :
A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration
Author :
Dili Wu ; Gokhale, Aniruddha
Author_Institution :
Dept. of EECS, Vanderbilt Univ., Nashville, TN, USA
Abstract :
One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.
Keywords :
equivalence classes; parallel programming; pattern classification; pattern clustering; public domain software; simulated annealing; software performance evaluation; Amazon Ee2; Analyzer phase; Apache Hadoop framework; Hadoop configuration parameters; Hadoop configuration settings; MapReduce-based application programming; PPABS framework; Recognizer phase; deduced application performance requirements; equivalence classes; job classification; k-means ++ clustering; performance optimization; profiling and performance analysis-based system framework; resource management; simulated annealing algorithms; Kernel; Legged locomotion; Hadoop; MapReduce; optimization; self-tuning;
Conference_Titel :
High Performance Computing (HiPC), 2013 20th International Conference on
Conference_Location :
Bangalore
DOI :
10.1109/HiPC.2013.6799133