Title :
Highly scalable checkpointing for exascale computing
Author :
Karlsson, Christer ; Chen, Zizhong
Author_Institution :
Colorado Sch. of Mines, Golden, CO, USA
Abstract :
A consequence of the fact that the number of processors in High Performance Computers (HPC) continues to increase is demonstrated by the correlation between Mean-Time-To-Failure(TMTTF ) and application execution time. The TMTTF is becoming shorter than the expected execution time for many next generation HPC applications. There is an ability to handle failure without a system-wide breakdown in most architecture, but many of the applications do not have a built-in ability to survive node failures. The purpose of this paper is to present an approach to develop a highly scalable technique to allow the next generation applications to survive node and/or link failure without aborting the computation. We will develop several strategies to improve the scalability of diskless checkpointing. The technique is scalable in the sense that when the number of processes increases, the overhead to handle k failures on p processes should remain as constant as possible. We will present the proposed technique, initial results together with remaining objectives and challenges.
Keywords :
checkpointing; HPC; diskless checkpointing; exascale computing; high performance computer; highly scalable checkpointing; mean-time-to-failure; Application software; Checkpointing; Computational modeling; Computer architecture; Delay; Earthquakes; Encoding; High performance computing; Large-scale systems; Scalability; diskless checkpointing; exascale; multi failure; topology aware;
Conference_Titel :
Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4244-6533-0
DOI :
10.1109/IPDPSW.2010.5470810