DocumentCode
2405606
Title
Coordinated checkpoint versus message log for fault tolerant MPI
Author
Bouteiller, Aurélien ; Lemarinier, Pierre ; Krawezik, Géraud ; Capello
Author_Institution
LRI, Univ. de Paris Sud, France
fYear
2003
fDate
1-4 Dec. 2003
Firstpage
242
Lastpage
250
Abstract
MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. There are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.
Keywords
distributed programming; fault tolerant computing; grid computing; message passing; performance evaluation; system recovery; workstation clusters; PC clusters; coordinated checkpoint; coordinated checkpointing; fault free executions; fault frequency; fault tolerant MPI; grid computing; message log; message logging protocol; message transfer penalty; network failures; node failures; performance evaluation; programming models; restart times; synchronization cost; Checkpointing; Clouds; Computer fault tolerance; Costs; Electronic mail; Fault tolerance; Frequency synchronization; High performance computing; Large-scale systems; Message passing; Protocols; System recovery;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on
Print_ISBN
0-7695-2066-9
Type
conf
DOI
10.1109/CLUSTR.2003.1253321
Filename
1253321
Link To Document