• DocumentCode
    1983949
  • Title

    Coherence-based coordinated checkpointing for software distributed shared memory systems

  • Author

    Kongmunvattana, Angkul ; Tanchatchawal, Santipong ; Tzeng, Nian-Feng

  • Author_Institution
    Center for Adv. Comput. Studies, Louisiana Univ., Lafayette, LA, USA
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    556
  • Lastpage
    563
  • Abstract
    Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. We propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both the checkpointing overhead during failure-free execution and the cost of recovery from failures by leveraging existing coherence information maintained by SDSM. In the presence of system failures, it allows SDSM to recover from the most recent checkpoint, saving the re-computation time. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our CCC technique against both simple coordinated checkpointing (SCC) and incremental coordinated checkpointing (ICC) techniques by actually implementing these techniques in TreadMarks, a stare-of-the-art SDSM system. The experimental results demonstrate that our CCC technique consistently outperforms both SCC and ICC techniques. In particular our technique increases the execution time slightly by 0.5% to 4% for a 2-minute checkpointing interval during failure-free execution, while SCC and ICC techniques result in the execution time overhead of 4% to 100% and 3% to 64%, respectively for the same checkpointing interval
  • Keywords
    distributed shared memory systems; software fault tolerance; software performance evaluation; system recovery; workstation clusters; Sun Ultra-5 workstations; TreadMarks; coherence-based coordinated checkpointing; execution time; experiments; failure recovery; fault-tolerant techniques; incremental coordinated checkpointing; parallel computing; simple coordinated checkpointing; software distributed shared memory systems; system failures; workstation clusters; Checkpointing; Costs; Distributed computing; Parallel programming; Protection; Protocols; Read only memory; Software systems; Space technology; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems, 2000. Proceedings. 20th International Conference on
  • Conference_Location
    Taipei
  • ISSN
    1063-6927
  • Print_ISBN
    0-7695-0601-1
  • Type

    conf

  • DOI
    10.1109/ICDCS.2000.840970
  • Filename
    840970