DocumentCode
2989612
Title
Towards self-caring mapreduce: Proactively reducing fault-induced execution-time penalties
Author
Kadirvel, Selvi ; Fortes, José A B
Author_Institution
Adv. Comput. & Inf. Syst. Lab., Univ. of Florida, Gainesville, FL, USA
fYear
2011
fDate
4-8 July 2011
Firstpage
63
Lastpage
71
Abstract
Self-Caring IT systems are those that can proactively avoid system failures rather than reactively handle failures after they have occurred. In this paper, we are interested in failures in which a MapReduce job is unable to execute within an SLA-based completion time. The existing fault tolerance capability provided by Map Reduce frameworks is simple and the penalty associated with handling failures could potentially lead to excessive job execution times. Our goal in this paper is to bring out the severity of this penalty for different job characteristics and configurable framework parameters. We first quantitatively evaluate the penalty in execution time associated with node failures in the open-source MapReduce framework, Hadoop using the MRPerf simulator. This increase in execution time is particularly expensive in pay-as-you-go cloud infrastructures where users are charged by resource usage duration. Our solution minimizes job-completion-time SLA violations by augmenting the existing fault-tolerance capability of the MapReduce framework using a dynamic resource scaling approach. This resource scaling approach leverages the elastic properties of a cloud, in order to mitigate execution time penalties and hence proactively avoids a potential job failure. Using our proposed approach for various job and framework parameters, we show that performance penalties can be decreased by up to 78% in the case of singlenode failures and by up to 100% in the case of 4-node failures at minimal additional cost.
Keywords
cloud computing; fault tolerant computing; public domain software; resource allocation; MRPerf simulator; SLA-based completion time; configurable framework parameter; dynamic resource scaling approach; elastic property; failure handling; fault tolerance capability; fault-induced execution-time penalty reduction; fault-tolerance capability; open-source MapReduce framework; pay-as-you-go cloud infrastructures; resource usage duration; self-caring IT system; self-caring MapReduce; Distributed databases; Dynamic scheduling; Fault tolerance; Fault tolerant systems; Google; Organizations; Runtime; MapReduce; autonomic computing; cloud computing; failure; scaling; self-caring;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing and Simulation (HPCS), 2011 International Conference on
Conference_Location
Istanbul
Print_ISBN
978-1-61284-380-3
Type
conf
DOI
10.1109/HPCSim.2011.5999808
Filename
5999808
Link To Document