DocumentCode
3025330
Title
Extending a cluster SSI OS for transparently checkpointing message-passing parallel applications
Author
Fertre, Matthieu ; Morin, Christine
Author_Institution
IRISA, Rennes, France
fYear
2005
fDate
7-9 Dec. 2005
Abstract
Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution times. Since the number of nodes in clusters is growing, faults are more frequent. Thus the application execution time may be greater than the mean time before failure (MTBF) of the cluster. To avoid restarting application from the beginning, it is desirable that cluster systems provide some fault tolerant mechanisms such as checkpoint/restart. An approach to implement efficiently this mechanism is to implement it directly in the application or in the communication library. Another approach is to implement it in an operating system dedicated to clusters. This is more complex but let you checkpoint/restart any message-passing application whatever the communication library. This paper presents basic mechanisms for system initiated checkpoint of message-passing parallel applications running on top of a cluster. Performance results obtained from a prototype implemented in KERRIGHED Single System Image cluster Operating System based on LINUX are presented.
Keywords
Linux; checkpointing; fault tolerant computing; message passing; parallel programming; workstation clusters; KERRIGHED Single System Image cluster Operating System; LINUX; checkpointing; communication library; fault tolerant mechanism; mean time before failure; message-passing parallel application; restart mechanism; scientific application; Application software; Checkpointing; Fault tolerant systems; Hardware; Joining processes; Linux; Operating systems; Protocols; Prototypes; Software libraries; checkpointing; global coordination.; parallel application; single system image;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Architectures,Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8th International Symposium on
ISSN
1087-4089
Print_ISBN
0-7695-2509-1
Type
conf
DOI
10.1109/ISPAN.2005.46
Filename
1575851
Link To Document