DocumentCode :
1448186
Title :
An architecture for tolerating processor failures in shared-memory multiprocessors
Author :
Banâtre, Michel ; Gefflaut, Alain ; Joubert, Philippe ; Morin, Christine ; Lee, Peter A.
Author_Institution :
Campus Univ. de Beaulieu, IRISA, Rennes, France
Volume :
45
Issue :
10
fYear :
1996
fDate :
10/1/1996 12:00:00 AM
Firstpage :
1101
Lastpage :
1115
Abstract :
This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications
Keywords :
fault tolerant computing; memory protocols; performance evaluation; shared memory systems; address traces; cache coherence protocols; fault tolerance; hardware supported backward error recovery mechanism; processor failures toleration; recoverable shared memory; shared-memory multiprocessors; simulation; Application software; Computational modeling; Computer architecture; Computer errors; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Operating systems; Protocols;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/12.543705
Filename :
543705
Link To Document :
بازگشت