• DocumentCode
    3453847
  • Title

    Checkpointing and recovery of shared memory parallel applications in a cluster

  • Author

    Badrinath, R. ; Morin, Christine ; Vallee, Geoffroy

  • Author_Institution
    IRISA/INRIA, France
  • fYear
    2003
  • fDate
    12-15 May 2003
  • Firstpage
    471
  • Lastpage
    477
  • Abstract
    This paper describes issues in the design and implementation of checkpointing and recovery modules for the Kerrighed DSM cluster system. Our design is for a DSM supporting the sequential consistency model. The mechanisms are general enough to be used in a number of different checkpointing and recovery protocols. It is designed to support common optimizations for performance suggested in literature, while staying light-weight during fault free execution. We also present preliminary performance results of the current implementation.
  • Keywords
    distributed shared memory systems; fault tolerant computing; protocols; system recovery; workstation clusters; Kerrighed DSM cluster system; checkpointing; cluster computing; fault tolerance; recovery protocol; shared memory parallel application; Checkpointing; Containers; Fault tolerant systems; Kernel; Linux; Memory management; Operating systems; Protocols; Random access memory; Research and development;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on
  • Print_ISBN
    0-7695-1919-9
  • Type

    conf

  • DOI
    10.1109/CCGRID.2003.1199403
  • Filename
    1199403