DocumentCode :
2121
Title :
Multilevel Diskless Checkpointing
Author :
Hakkarinen, Doug ; Zizhong Chen
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Volume :
62
Issue :
4
fYear :
2013
fDate :
Apr-13
Firstpage :
772
Lastpage :
783
Abstract :
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today´s systems. Counteracting this higher failure rate may require a combination of disk-based checkpointing, diskless checkpointing, and algorithmic fault tolerance. Diskless checkpointing is an efficient technique to tolerate a small number of process failures in large parallel and distributed systems. In the literature, a simultaneous failure of no more than N processes is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous process failures, whose overhead often increases quickly as N increases. We introduce an N-level diskless checkpointing scheme that reduces the overhead for tolerating a simultaneous failure of up to N processes. Each level is a diskless checkpointing scheme for a simultaneous failure of i processes, where i = 1, 2,..., N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.
Keywords :
checkpointing; parallel processing; software fault tolerance; algorithmic fault tolerance; disk-based checkpointing; distributed systems; extreme scale systems; multilevel diskless checkpointing; one-level Reed-Solomon checkpointing scheme; parallel systems; Checkpointing; Encoding; Fault tolerance; Fault tolerant systems; Reed-Solomon codes; Runtime; Schedules; Extreme scale systems; checkpoint; diskless checkpointing; fault tolerance; high-performance computing;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2012.17
Filename :
6127862
Link To Document :
بازگشت