• DocumentCode
    1236762
  • Title

    Adaptive Fault Management of Parallel Applications for High-Performance Computing

  • Author

    Lan, Zhiling ; Li, Yawei

  • Author_Institution
    Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
  • Volume
    57
  • Issue
    12
  • fYear
    2008
  • Firstpage
    1647
  • Lastpage
    1660
  • Abstract
    As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.
  • Keywords
    checkpointing; parallel machines; performance evaluation; adaptive fault management; high-performance computing; periodic checkpointing; preventive migration; proactive migration; reactive checkpointing; resource utilization; stochastic modeling; Application software; Checkpointing; Computer Society; Computer applications; Concurrent computing; Power engineering computing; Resilience; Resource management; Runtime; Stochastic processes; Fault tolerance; Performance evaluation of algorithms and systems;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2008.90
  • Filename
    4531733