Towards self-caring mapreduce: Proactively reducing fault-induced execution-time penalties

Author

Kadirvel, Selvi ; Fortes, José A B

Author_Institution

Adv. Comput. & Inf. Syst. Lab., Univ. of Florida, Gainesville, FL, USA

fYear

2011

fDate

4-8 July 2011

Firstpage

63

Lastpage

71

Abstract

Self-Caring IT systems are those that can proactively avoid system failures rather than reactively handle failures after they have occurred. In this paper, we are interested in failures in which a MapReduce job is unable to execute within an SLA-based completion time. The existing fault tolerance capability provided by Map Reduce frameworks is simple and the penalty associated with handling failures could potentially lead to excessive job execution times. Our goal in this paper is to bring out the severity of this penalty for different job characteristics and configurable framework parameters. We first quantitatively evaluate the penalty in execution time associated with node failures in the open-source MapReduce framework, Hadoop using the MRPerf simulator. This increase in execution time is particularly expensive in pay-as-you-go cloud infrastructures where users are charged by resource usage duration. Our solution minimizes job-completion-time SLA violations by augmenting the existing fault-tolerance capability of the MapReduce framework using a dynamic resource scaling approach. This resource scaling approach leverages the elastic properties of a cloud, in order to mitigate execution time penalties and hence proactively avoids a potential job failure. Using our proposed approach for various job and framework parameters, we show that performance penalties can be decreased by up to 78% in the case of singlenode failures and by up to 100% in the case of 4-node failures at minimal additional cost.

Keywords

cloud computing; fault tolerant computing; public domain software; resource allocation; MRPerf simulator; SLA-based completion time; configurable framework parameter; dynamic resource scaling approach; elastic property; failure handling; fault tolerance capability; fault-induced execution-time penalty reduction; fault-tolerance capability; open-source MapReduce framework; pay-as-you-go cloud infrastructures; resource usage duration; self-caring IT system; self-caring MapReduce; Distributed databases; Dynamic scheduling; Fault tolerance; Fault tolerant systems; Google; Organizations; Runtime; MapReduce; autonomic computing; cloud computing; failure; scaling; self-caring;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing and Simulation (HPCS), 2011 International Conference on

Conference_Location

Istanbul

Print_ISBN

978-1-61284-380-3

Type

conf

DOI

10.1109/HPCSim.2011.5999808

Filename

5999808