Title :
Enhancing application robustness through adaptive fault tolerance
Author :
Lan, Zhiling ; Li, Yawei ; Zheng, Ziming ; Gujrati, Prashasta
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
Abstract :
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adaptive fault tolerance system for HPC applications. It aims to enable parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. Both prior and ongoing work are summarized in this paper.
Keywords :
checkpointing; parallel processing; software fault tolerance; adaptive fault tolerance; application robustness enhancement; high performance computing; parallel applications; preventive migration; selective checkpointing; Adaptive systems; Application software; Checkpointing; Computer science; Fault tolerance; Fault tolerant systems; High performance computing; Resilience; Robustness; Runtime;
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2008.4536383