DocumentCode :
627437
Title :
Getting more for less in optimized MapReduce workflows
Author :
Zhuoyao Zhang ; Cherkasova, Ludmila ; Boon Thau Loo
fYear :
2013
fDate :
27-31 May 2013
Firstpage :
93
Lastpage :
100
Abstract :
Many companies are piloting the use of Hadoop for advanced data analytics over large datasets. Typically, such MapReduce programs represent workflows of MapReduce jobs. Currently, a user must specify the number of reduce tasks for each MapReduce job. The choice of the right number of reduce tasks is non-trivial and depends on the cluster size, input dataset of the job, and the amount of resources available for processing this job. In the workflow of MapReduce jobs, the output of one job becomes the input of the next job, and therefore the number of reduce tasks in the previous job may impact the performance and processing efficiency of the next job. In this work,1 we offer a novel performance evaluation framework for easing the user efforts of tuning the reduce task settings while achieving performance objectives. The proposed framework is based on two performance models: a platform performance model and a workflow performance model. A platform performance model characterizes the execution time of each generic phase in the MapReduce processing pipeline as a function of processed data. The complementary workflow performance model evaluates the completion time of a given workflow as a function of i) input dataset size(s) and ii) the reduce tasks´ settings in the jobs that comprise a given workflow. We validate the accuracy, effectiveness, and performance benefits of the proposed framework using a set of realistic MapReduce applications and queries from the TPC-H benchmark.
Keywords :
data analysis; parallel programming; pipeline processing; software performance evaluation; task analysis; workflow management software; Hadoop; MapReduce job workflow; MapReduce processing pipeline; MapReduce programs; MapReduce queries; TPC-H benchmark; advanced data analytics; cluster size; complementary workflow performance model; generic phase; input dataset; input dataset size; job processing; optimized MapReduce workflows; performance evaluation framework; performance impact; platform performance model; processing efficiency; task reduction; workflow completion time evaluation; workflow performance model; Benchmark testing; Computational modeling; Data models; Phase measurement; Production; Time measurement; Tuning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Integrated Network Management (IM 2013), 2013 IFIP/IEEE International Symposium on
Conference_Location :
Ghent
Print_ISBN :
978-1-4673-5229-1
Type :
conf
Filename :
6572974
Link To Document :
بازگشت