DocumentCode :
1686713
Title :
Enhancing application robustness through adaptive fault tolerance
Author :
Lan, Zhiling ; Li, Yawei ; Zheng, Ziming ; Gujrati, Prashasta
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
fYear :
2008
Firstpage :
1
Lastpage :
5
Abstract :
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adaptive fault tolerance system for HPC applications. It aims to enable parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. Both prior and ongoing work are summarized in this paper.
Keywords :
checkpointing; parallel processing; software fault tolerance; adaptive fault tolerance; application robustness enhancement; high performance computing; parallel applications; preventive migration; selective checkpointing; Adaptive systems; Application software; Checkpointing; Computer science; Fault tolerance; Fault tolerant systems; High performance computing; Resilience; Robustness; Runtime;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2008.4536383
Filename :
4536383
Link To Document :
بازگشت