• DocumentCode
    1448186
  • Title

    An architecture for tolerating processor failures in shared-memory multiprocessors

  • Author

    Banâtre, Michel ; Gefflaut, Alain ; Joubert, Philippe ; Morin, Christine ; Lee, Peter A.

  • Author_Institution
    Campus Univ. de Beaulieu, IRISA, Rennes, France
  • Volume
    45
  • Issue
    10
  • fYear
    1996
  • fDate
    10/1/1996 12:00:00 AM
  • Firstpage
    1101
  • Lastpage
    1115
  • Abstract
    This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications
  • Keywords
    fault tolerant computing; memory protocols; performance evaluation; shared memory systems; address traces; cache coherence protocols; fault tolerance; hardware supported backward error recovery mechanism; processor failures toleration; recoverable shared memory; shared-memory multiprocessors; simulation; Application software; Computational modeling; Computer architecture; Computer errors; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Operating systems; Protocols;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/12.543705
  • Filename
    543705