DocumentCode
3248243
Title
Environmental-aware optimization of MPI checkpointing intervals
Author
Jitsumoto, Hideyuki ; Endo, Toshio ; Matsuoka, Satoshi
Author_Institution
Tokyo Inst. of Technol., Tokyo
fYear
2008
fDate
Sept. 29 2008-Oct. 1 2008
Firstpage
326
Lastpage
329
Abstract
Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory. Therefore, automated optimization of the checkpoint interval is essential, but the optimal point depends on hardware failure rates and I/O bandwidth. Our new model and an algorithm, which is an extension of Vaidyapsilas model, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5.5% improvement over statically user-determined cases.
Keywords
checkpointing; fault tolerant computing; message passing; optimisation; HPC systems; MPI checkpointing intervals; Vaidya model; collective application memory; environmental-aware optimization; fault-tolerance; rollback recovery; Bandwidth; Checkpointing; Cost function; Design optimization; Exponential distribution; Fault tolerance; Fault tolerant systems; Informatics; Prototypes; Supercomputers;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing, 2008 IEEE International Conference on
Conference_Location
Tsukuba
ISSN
1552-5244
Print_ISBN
978-1-4244-2639-3
Electronic_ISBN
1552-5244
Type
conf
DOI
10.1109/CLUSTR.2008.4663790
Filename
4663790
Link To Document