DocumentCode :
1998414
Title :
Sustained Resilience via Live Process Cloning
Author :
Rezaei, A. ; Mueller, Frank
Author_Institution :
Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
1498
Lastpage :
1507
Abstract :
More flexible fault tolerance approaches with lower overhead are a must for the next generation of supercomputers that rely on massive numbers of computational elements. This work proposes a reactive method for fault resilience in high-performance computing (HPC) systems based on forward execution instead of rollback to checkpoints. We study the feasibility of combining redundancy with live process cloning to create highly reliable HPC systems. The main motivation is to avoid costly checkpoint restart approaches. We present live process cloning as a mechanism to create a copy of a running process on-the-fly. We show that the reliability of a dual redundant system with live process cloning is as good as a triple redundant system even for very large systems. We also investigate the effect of node failure and the changes in Mean time to Interrupt (MTTI) of the application. This provides a better understanding of the available time to recover from a failure by cloning a healthy replica.
Keywords :
checkpointing; fault tolerant computing; parallel processing; redundancy; reliability; HPC systems; MTTI; checkpoint restart approach; computational elements; dual redundant system; fault resilience; flexible fault tolerance approaches; high-performance computing systems; live process cloning; mean time to interrupt; next generation supercomputers; sustained resilience; triple redundant system; Checkpointing; Cloning; Computational modeling; Logic gates; Redundancy; Resilience; Fault Resilience; HPC; Process Cloning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
Type :
conf
DOI :
10.1109/IPDPSW.2013.224
Filename :
6651044
Link To Document :
بازگشت