DocumentCode :
1922244
Title :
Proactive Resource Management for Failure Resilient High Performance Computing Clusters
Author :
Fu, Song ; Xu, Cheng-Zhong
Author_Institution :
Dept. of Comput. Sci., New Mexico Inst. of Min. & Tech., Socorro, NM
fYear :
2009
fDate :
16-19 March 2009
Firstpage :
257
Lastpage :
264
Abstract :
Virtual machine (VM) technology provides an additional layer of abstraction for resource management in high-performance computing (HPC) systems. In large-scale computing clusters, component failures become norms instead of exceptions, caused by the ever-increasing system complexity. VM construction and reconfiguration is a potent tool for efficient online system maintenance and failure resilience. In this paper, we study how VM-based HPC clusters benefits from failure prediction in resource management for dependable computing. We consider both the reliability and performance status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors, and propose the Best-fit algorithm to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from the Los Alamos National Laboratory (LANL) HPC clusters. The results show the enhancement of system dependability by using our proposed strategy with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 10.5% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 82.5% with improved utilization of relatively unreliable nodes.
Keywords :
resource allocation; software maintenance; software reliability; system recovery; virtual machines; workstation clusters; HPC clusters; VM construction; VM reconfiguration; best-fit algorithm; capacity-reliability metric; component failure; efficient online system maintenance; failure prediction; failure resilience; failure resilient high performance computing clusters; high-performance computing; job completion rate; large-scale computing cluster; proactive resource management; system complexity; system dependability; task completion rate; virtual machine; Clustering algorithms; High performance computing; Laboratories; Large-scale systems; Maintenance; Resilience; Resource management; Virtual machining; Virtual manufacturing; Voice mail; Dependable systems; Failure resilience; High performance computing; Proactive management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Availability, Reliability and Security, 2009. ARES '09. International Conference on
Conference_Location :
Fukuoka
Print_ISBN :
978-1-4244-3572-2
Electronic_ISBN :
978-0-7695-3564-7
Type :
conf
DOI :
10.1109/ARES.2009.13
Filename :
5066481
Link To Document :
بازگشت