• DocumentCode
    1922244
  • Title

    Proactive Resource Management for Failure Resilient High Performance Computing Clusters

  • Author

    Fu, Song ; Xu, Cheng-Zhong

  • Author_Institution
    Dept. of Comput. Sci., New Mexico Inst. of Min. & Tech., Socorro, NM
  • fYear
    2009
  • fDate
    16-19 March 2009
  • Firstpage
    257
  • Lastpage
    264
  • Abstract
    Virtual machine (VM) technology provides an additional layer of abstraction for resource management in high-performance computing (HPC) systems. In large-scale computing clusters, component failures become norms instead of exceptions, caused by the ever-increasing system complexity. VM construction and reconfiguration is a potent tool for efficient online system maintenance and failure resilience. In this paper, we study how VM-based HPC clusters benefits from failure prediction in resource management for dependable computing. We consider both the reliability and performance status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors, and propose the Best-fit algorithm to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from the Los Alamos National Laboratory (LANL) HPC clusters. The results show the enhancement of system dependability by using our proposed strategy with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 10.5% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 82.5% with improved utilization of relatively unreliable nodes.
  • Keywords
    resource allocation; software maintenance; software reliability; system recovery; virtual machines; workstation clusters; HPC clusters; VM construction; VM reconfiguration; best-fit algorithm; capacity-reliability metric; component failure; efficient online system maintenance; failure prediction; failure resilience; failure resilient high performance computing clusters; high-performance computing; job completion rate; large-scale computing cluster; proactive resource management; system complexity; system dependability; task completion rate; virtual machine; Clustering algorithms; High performance computing; Laboratories; Large-scale systems; Maintenance; Resilience; Resource management; Virtual machining; Virtual manufacturing; Voice mail; Dependable systems; Failure resilience; High performance computing; Proactive management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Availability, Reliability and Security, 2009. ARES '09. International Conference on
  • Conference_Location
    Fukuoka
  • Print_ISBN
    978-1-4244-3572-2
  • Electronic_ISBN
    978-0-7695-3564-7
  • Type

    conf

  • DOI
    10.1109/ARES.2009.13
  • Filename
    5066481