DocumentCode
1236762
Title
Adaptive Fault Management of Parallel Applications for High-Performance Computing
Author
Lan, Zhiling ; Li, Yawei
Author_Institution
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
Volume
57
Issue
12
fYear
2008
Firstpage
1647
Lastpage
1660
Abstract
As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.
Keywords
checkpointing; parallel machines; performance evaluation; adaptive fault management; high-performance computing; periodic checkpointing; preventive migration; proactive migration; reactive checkpointing; resource utilization; stochastic modeling; Application software; Checkpointing; Computer Society; Computer applications; Concurrent computing; Power engineering computing; Resilience; Resource management; Runtime; Stochastic processes; Fault tolerance; Performance evaluation of algorithms and systems;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/TC.2008.90
Filename
4531733
Link To Document