DocumentCode :
1684082
Title :
An optimal checkpoint/restart model for a large scale high performance computing system
Author :
Yudan Liu ; Nassar, R. ; Leangsuksun, C. ; Naksinehaboon, N. ; Paun, M. ; Scott, S.L.
Author_Institution :
Coll. of Eng. & Sci., Louisiana Tech Univ., Ruston, LA
fYear :
2008
Firstpage :
1
Lastpage :
9
Abstract :
The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.
Keywords :
checkpointing; fault tolerant computing; parallel processing; checkpoint interval; failure distributions; fault tolerant mechanisms; high performance computing; large scale HPC system; optimal checkpoint-restart model; performance loss; reliability-aware method; rollback time; Checkpointing; Cost function; Data analysis; Fault tolerant systems; High performance computing; Large-scale systems; Mathematical model; Mathematics; Reliability; Stochastic processes; HPC; Large-scale distributed system events log analysis; checkpoint/restart model; fault-tolerance; reliability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Type :
conf
DOI :
10.1109/IPDPS.2008.4536279
Filename :
4536279
Link To Document :
بازگشت