DocumentCode
1439591
Title
Algorithm-based fault tolerance on a hypercube multiprocessor
Author
Banerjee, Prithviraj ; Rahmeh, Joe T. ; Stunkel, Craig ; Nair, V.S. ; Roy, Kaushik ; Balasubramanian, Vijay ; Abraham, Jacob A.
Author_Institution
Dept. of Electr. Eng., Illinois Univ., Urbana, IL, USA
Volume
39
Issue
9
fYear
1990
fDate
9/1/1990 12:00:00 AM
Firstpage
1132
Lastpage
1145
Abstract
The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors
Keywords
fault tolerant computing; multiprocessing systems; parallel architectures; Gaussian elimination; Intel iPSC hypercube; error detection; fast Fourier transform; fault tolerance; faulty processors; hypercube multiprocessor; matrix multiplication; multiprocessor architecture; Computer architecture; Computer errors; Costs; Fault detection; Fault diagnosis; Fault tolerance; Hypercubes; Jacobian matrices; Joining processes; Parallel architectures;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/12.57055
Filename
57055
Link To Document