مرکز منطقه ای اطلاع رساني علوم و فناوري - Highly scalable checkpointing for exascale computing

DocumentCode :

2449112

Title :

Highly scalable checkpointing for exascale computing

Author :

Karlsson, Christer ; Chen, Zizhong

Author_Institution :

Colorado Sch. of Mines, Golden, CO, USA

fYear :

2010

fDate :

19-23 April 2010

Firstpage :

Lastpage :

Abstract :

A consequence of the fact that the number of processors in High Performance Computers (HPC) continues to increase is demonstrated by the correlation between Mean-Time-To-Failure(T_MTTF ) and application execution time. The T_MTTF is becoming shorter than the expected execution time for many next generation HPC applications. There is an ability to handle failure without a system-wide breakdown in most architecture, but many of the applications do not have a built-in ability to survive node failures. The purpose of this paper is to present an approach to develop a highly scalable technique to allow the next generation applications to survive node and/or link failure without aborting the computation. We will develop several strategies to improve the scalability of diskless checkpointing. The technique is scalable in the sense that when the number of processes increases, the overhead to handle k failures on p processes should remain as constant as possible. We will present the proposed technique, initial results together with remaining objectives and challenges.

Keywords :

checkpointing; HPC; diskless checkpointing; exascale computing; high performance computer; highly scalable checkpointing; mean-time-to-failure; Application software; Checkpointing; Computational modeling; Computer architecture; Delay; Earthquakes; Encoding; High performance computing; Large-scale systems; Scalability; diskless checkpointing; exascale; multi failure; topology aware;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on

Conference_Location :

Atlanta, GA

Print_ISBN :

978-1-4244-6533-0

Type :

conf

DOI :

10.1109/IPDPSW.2010.5470810

Filename :

5470810

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2449112