DocumentCode :
3047346
Title :
A hierarchical checkpointing protocol for parallel applications in cluster federations
Author :
Monnet, Sébastien ; Morin, Christine ; Badrinath, Ramamurthy
Author_Institution :
IRISA, Rennes, France
fYear :
2004
fDate :
26-30 April 2004
Firstpage :
211
Abstract :
Summary form only given. Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
Keywords :
discrete event simulation; parallel processing; performance evaluation; protocols; system recovery; workstation clusters; cluster federations; code coupling applications; discrete event simulator; hierarchical checkpointing protocol; node failure; parallel applications; performance evaluation; Application software; Bandwidth; Checkpointing; Delay; Discrete event simulation; ISO standards; Local area networks; Performance evaluation; Protocols; Storage area networks;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
Print_ISBN :
0-7695-2132-0
Type :
conf
DOI :
10.1109/IPDPS.2004.1303242
Filename :
1303242
Link To Document :
بازگشت