DocumentCode :
2541020
Title :
An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor
Author :
Banerjee, P. ; Rahmeh, J.T. ; Stunkel, C.B. ; Nair, V.S.S. ; Roy, K. ; Abraham, J.A.
Author_Institution :
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
fYear :
1988
fDate :
27-30 June 1988
Firstpage :
362
Lastpage :
367
Abstract :
A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses a novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube. The authors have implemented system-level fault-detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. They report on the results of two applications: matrix multiplication and fast Fourier transform. They have performed extensive studies of fault coverage of their system-level fault-detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. They propose a reconfiguration strategy for reconfiguring the system around faulty processors by introducing spare links and nodes.<>
Keywords :
fast Fourier transforms; fault tolerant computing; matrix algebra; multiprocessor interconnection networks; parallel architectures; 16 bit; 16-processor Intel iPSC hypercube multiprocessor; Intel hypercube multiprocessor; algorithm-based fault-detection; fast Fourier transform; fault coverage; fault-tolerant hypercube multiprocessor architecture; finite-precision arithmetic; matrix multiplication; reconfiguration strategy; spare links; system-level encodings; system-level fault tolerance; system-level fault-detection mechanisms; system-level fault-detection schemes; Algorithm design and analysis; Computer architecture; Costs; Fault detection; Fault diagnosis; Fault tolerance; Fault tolerant systems; Hypercubes; Logic; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fault-Tolerant Computing, 1988. FTCS-18, Digest of Papers., Eighteenth International Symposium on
Conference_Location :
Tokyo, Japan
Print_ISBN :
0-8186-0867-6
Type :
conf
DOI :
10.1109/FTCS.1988.5344
Filename :
5344
Link To Document :
بازگشت