Proactive Resource Management for Failure Resilient High Performance Computing Clusters

Author

Fu, Song ; Xu, Cheng-Zhong

Author_Institution

Dept. of Comput. Sci., New Mexico Inst. of Min. & Tech., Socorro, NM

fYear

2009

fDate

16-19 March 2009

Firstpage

257

Lastpage

264

Abstract

Virtual machine (VM) technology provides an additional layer of abstraction for resource management in high-performance computing (HPC) systems. In large-scale computing clusters, component failures become norms instead of exceptions, caused by the ever-increasing system complexity. VM construction and reconfiguration is a potent tool for efficient online system maintenance and failure resilience. In this paper, we study how VM-based HPC clusters benefits from failure prediction in resource management for dependable computing. We consider both the reliability and performance status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors, and propose the Best-fit algorithm to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from the Los Alamos National Laboratory (LANL) HPC clusters. The results show the enhancement of system dependability by using our proposed strategy with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 10.5% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 82.5% with improved utilization of relatively unreliable nodes.

Keywords

resource allocation; software maintenance; software reliability; system recovery; virtual machines; workstation clusters; HPC clusters; VM construction; VM reconfiguration; best-fit algorithm; capacity-reliability metric; component failure; efficient online system maintenance; failure prediction; failure resilience; failure resilient high performance computing clusters; high-performance computing; job completion rate; large-scale computing cluster; proactive resource management; system complexity; system dependability; task completion rate; virtual machine; Clustering algorithms; High performance computing; Laboratories; Large-scale systems; Maintenance; Resilience; Resource management; Virtual machining; Virtual manufacturing; Voice mail; Dependable systems; Failure resilience; High performance computing; Proactive management;

fLanguage

English

Publisher

ieee

Conference_Titel

Availability, Reliability and Security, 2009. ARES '09. International Conference on

Conference_Location

Fukuoka

Print_ISBN

978-1-4244-3572-2

Electronic_ISBN

978-0-7695-3564-7

Type

conf

DOI

10.1109/ARES.2009.13

Filename

5066481