DocumentCode :
1684587
Title :
Scalable group-based checkpoint/restart for large-scale message-passing systems
Author :
Ho, Justin C Y ; Wang, Cho-Li ; Lau, Francis C M
Author_Institution :
Dept. of Comput. Sci., Univ. of Hong Kong, Hong Kong
fYear :
2008
Firstpage :
1
Lastpage :
12
Abstract :
The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed.
Keywords :
checkpointing; message passing; software fault tolerance; MPI; fault tolerance; large-scale message-passing systems; message logging; parallel computers; scalable group-based checkpoint-restart; system-level checkpointing; Application software; Checkpointing; Computer science; Concurrent computing; Costs; Failure analysis; Fault tolerance; Fault tolerant systems; Large-scale systems; Scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2008.4536302
Filename :
4536302
Link To Document :
بازگشت