DocumentCode
3453847
Title
Checkpointing and recovery of shared memory parallel applications in a cluster
Author
Badrinath, R. ; Morin, Christine ; Vallee, Geoffroy
Author_Institution
IRISA/INRIA, France
fYear
2003
fDate
12-15 May 2003
Firstpage
471
Lastpage
477
Abstract
This paper describes issues in the design and implementation of checkpointing and recovery modules for the Kerrighed DSM cluster system. Our design is for a DSM supporting the sequential consistency model. The mechanisms are general enough to be used in a number of different checkpointing and recovery protocols. It is designed to support common optimizations for performance suggested in literature, while staying light-weight during fault free execution. We also present preliminary performance results of the current implementation.
Keywords
distributed shared memory systems; fault tolerant computing; protocols; system recovery; workstation clusters; Kerrighed DSM cluster system; checkpointing; cluster computing; fault tolerance; recovery protocol; shared memory parallel application; Checkpointing; Containers; Fault tolerant systems; Kernel; Linux; Memory management; Operating systems; Protocols; Random access memory; Research and development;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on
Print_ISBN
0-7695-1919-9
Type
conf
DOI
10.1109/CCGRID.2003.1199403
Filename
1199403
Link To Document