DocumentCode
1448186
Title
An architecture for tolerating processor failures in shared-memory multiprocessors
Author
Banâtre, Michel ; Gefflaut, Alain ; Joubert, Philippe ; Morin, Christine ; Lee, Peter A.
Author_Institution
Campus Univ. de Beaulieu, IRISA, Rennes, France
Volume
45
Issue
10
fYear
1996
fDate
10/1/1996 12:00:00 AM
Firstpage
1101
Lastpage
1115
Abstract
This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications
Keywords
fault tolerant computing; memory protocols; performance evaluation; shared memory systems; address traces; cache coherence protocols; fault tolerance; hardware supported backward error recovery mechanism; processor failures toleration; recoverable shared memory; shared-memory multiprocessors; simulation; Application software; Computational modeling; Computer architecture; Computer errors; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Operating systems; Protocols;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/12.543705
Filename
543705
Link To Document