DocumentCode
3047346
Title
A hierarchical checkpointing protocol for parallel applications in cluster federations
Author
Monnet, Sébastien ; Morin, Christine ; Badrinath, Ramamurthy
Author_Institution
IRISA, Rennes, France
fYear
2004
fDate
26-30 April 2004
Firstpage
211
Abstract
Summary form only given. Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
Keywords
discrete event simulation; parallel processing; performance evaluation; protocols; system recovery; workstation clusters; cluster federations; code coupling applications; discrete event simulator; hierarchical checkpointing protocol; node failure; parallel applications; performance evaluation; Application software; Bandwidth; Checkpointing; Delay; Discrete event simulation; ISO standards; Local area networks; Performance evaluation; Protocols; Storage area networks;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
Print_ISBN
0-7695-2132-0
Type
conf
DOI
10.1109/IPDPS.2004.1303242
Filename
1303242
Link To Document