DocumentCode :
1465742
Title :
Algorithm-based fault location and recovery for matrix computations on multiprocessor systems
Author :
Roy-Chowdhury, Amber ; Banerjee, Prithviraj
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
45
Issue :
11
fYear :
1996
fDate :
11/1/1996 12:00:00 AM
Firstpage :
1239
Lastpage :
1247
Abstract :
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases
Keywords :
distributed processing; matrix algebra; multiprocessing programs; multiprocessing systems; parallel algorithms; software fault tolerance; system recovery; Intel iPSC/2 hypercube multicomputer; algorithm-based fault location; algorithm-based fault recovery; algorithm-based methods; coding theory; correctness; fault location; fault recovery; fault-location; fault-tolerant algorithm; faulty processors; general-purpose multicomputers; matrix computations; multiprocessor systems; numerical algorithms; parallel numerical algorithms; system level diagnosis; system-level diagnosis theory; Application software; Computer errors; Error correction; Fault diagnosis; Fault location; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; System software;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/12.544480
Filename :
544480
Link To Document :
بازگشت