Title :
N-Level Diskless Checkpointing
Author :
Hakkarinen, Doug ; Chen, Zizhong
Author_Institution :
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Abstract :
Diskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures, whose overhead often increases quickly as N increases. In this paper, we study an N-level diskless checkpointing scheme to reduce the overhead for tolerating a simultaneous failure of no more than N processors by layering the schemes for a simultaneous failure of i processors, where i = 1, 2, . . . ,N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.
Keywords :
checkpointing; parallel processing; software fault tolerance; N-level diskless checkpointing; Reed-Solomon checkpointing; distributed system; fault tolerance; large parallel system; simultaneous processor failure; Checkpointing; Computational modeling; Computer aided manufacturing; Concurrent computing; Contracts; Distributed computing; Fault tolerance; High performance computing; Reed-Solomon codes; USA Councils; checkpoint; diskless checkpointing; fault tolerance; high performance computing; parallel and distributed systems;
Conference_Titel :
High Performance Computing and Communications, 2009. HPCC '09. 11th IEEE International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4244-4600-1
Electronic_ISBN :
978-0-7695-3738-2
DOI :
10.1109/HPCC.2009.55