Title :
Hybrid checkpointing for parallel applications in cluster federations
Author :
Monnet, Sébastien ; Morin, Christine ; Badrinath, Ramamurthy
Author_Institution :
IRISA, Rennes, France
Abstract :
Cluster federations are attractive for executing applications like large scale code coupling. However faults may appear frequently in such architectures. Thus, checkpointing long-running applications is desirable to avoid to restart them from the beginning in the event of a node failure. To take into account the constraints of a cluster federation architecture, an hybrid checkpointing protocol is proposed. It uses global coordinated checkpointing inside clusters but only quasi-synchronous checkpointing techniques between clusters. The proposed protocol has been evaluated by simulation and fits well for applications that can be divided into modules with lots of communications within modules but few between them.
Keywords :
performance evaluation; protocols; system recovery; workstation clusters; cluster federation architecture; global coordinated checkpointing; hybrid checkpointing protocol; large scale code coupling; long-running applications; node failure; parallel applications; quasi-synchronous checkpointing; Checkpointing; Computational modeling; Computer architecture; Hardware; Large-scale systems; Local area networks; Protocols; Security; Storage area networks; Wide area networks;
Conference_Titel :
Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on
Print_ISBN :
0-7803-8430-X
DOI :
10.1109/CCGrid.2004.1336712