• DocumentCode
    2979044
  • Title

    Fault-tolerance using cache-coherent distributed shared memory systems

  • Author

    Hecht, D.L. ; Kavi, K.M. ; Gaede, R.K. ; Katsinis, C.

  • Author_Institution
    Alabama Univ., Huntsville, AL, USA
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    100
  • Lastpage
    105
  • Abstract
    Describes new protocols augmenting traditional cache coherency mechanisms to implement fault tolerance based on recovery blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a “domino effect” whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One set of such techniques requires communicating processes to periodically synchronize in order to checkpoint a globally consistent state. These schemes can be implemented more naturally on distributed shared memory systems using synchronization on shared memory. We have developed extensions to well-known cache-coherency methods (e.g. directory-based) for the implementation of checkpointing consistent states
  • Keywords
    cache storage; coherence; distributed shared memory systems; fault tolerant computing; memory protocols; synchronisation; system recovery; cache-coherent distributed shared memory systems; checkpointing; communicating process synchronization; concurrent processes; directory-based cache-coherency methods; domino effect; fault tolerance; globally consistent state; protocols; recovery blocks; rollback recovery; Decision support systems; Fault tolerant systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Architectures, Algorithms, and Networks, 1999. (I-SPAN '99) Proceedings. Fourth InternationalSymposium on
  • Conference_Location
    Perth/Fremantle, WA
  • ISSN
    1087-4089
  • Print_ISBN
    0-7695-0231-8
  • Type

    conf

  • DOI
    10.1109/ISPAN.1999.778924
  • Filename
    778924