DocumentCode :
1523906
Title :
Graceful degradation in algorithm-based fault tolerant multiprocessor systems
Author :
Yajnik, Shalini ; Jha, Niraj K.
Author_Institution :
Lucent Technol., AT&T Bell Labs., Murray Hill, NJ, USA
Volume :
8
Issue :
2
fYear :
1997
fDate :
2/1/1997 12:00:00 AM
Firstpage :
137
Lastpage :
153
Abstract :
Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system
Keywords :
error detection; fault location; fault tolerant computing; multiprocessing systems; performance evaluation; algorithm-based fault tolerant multiprocessor systems; concurrent error detection; data encoding; diagnosis algorithm; fault location; graceful degradation; integrated deterministic solution; reliability; Concurrent computing; Degradation; Design methodology; Fault detection; Fault diagnosis; Fault location; Fault tolerance; Fault tolerant systems; Multiprocessing systems; Signal processing algorithms;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/71.577256
Filename :
577256
Link To Document :
بازگشت