DocumentCode :
3145951
Title :
Algorithm-Based Recovery for Newton´s Method without Checkpointing
Author :
Liu, Hui ; Davies, Teresa ; Ding, Chong ; Karlsson, Christer ; Chen, Zizhong
Author_Institution :
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
1541
Lastpage :
1548
Abstract :
Check pointing is the most popular fault tolerance method used in high-performance computing (HPC) systems. However, increasing failure rates requires more frequent checkpoints, thus makes check pointing more expensive. We present a checkpoint-free fault tolerance technique. It takes advantage of both data dependencies and communication-induced redundancies of parallel applications to tolerate fail-stop failures. Under the specified conditions, our technique introduces no additional overhead when there is no actual failure in the computation and recover the lost data with low overhead. We add fault-tolerant capacity to Newton´s method by using our scheme and diskless check pointing. Numerical simulations indicate that our scheme introduces much less overhead than diskless check pointing does.
Keywords :
Newton method; fault tolerance; parallel processing; system recovery; HPC system; Newton method; algorithm-based recovery; checkpoint-free fault tolerance technique; communication-induced redundancy; data dependencies; diskless check pointing; failure rate; high-performance computing; Checkpointing; Computers; Fault tolerant systems; Jacobian matrices; Nonlinear systems; Redundancy;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.309
Filename :
6009013
Link To Document :
بازگشت