• DocumentCode
    2541020
  • Title

    An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor

  • Author

    Banerjee, P. ; Rahmeh, J.T. ; Stunkel, C.B. ; Nair, V.S.S. ; Roy, K. ; Abraham, J.A.

  • Author_Institution
    Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
  • fYear
    1988
  • fDate
    27-30 June 1988
  • Firstpage
    362
  • Lastpage
    367
  • Abstract
    A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses a novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube. The authors have implemented system-level fault-detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. They report on the results of two applications: matrix multiplication and fast Fourier transform. They have performed extensive studies of fault coverage of their system-level fault-detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. They propose a reconfiguration strategy for reconfiguring the system around faulty processors by introducing spare links and nodes.<>
  • Keywords
    fast Fourier transforms; fault tolerant computing; matrix algebra; multiprocessor interconnection networks; parallel architectures; 16 bit; 16-processor Intel iPSC hypercube multiprocessor; Intel hypercube multiprocessor; algorithm-based fault-detection; fast Fourier transform; fault coverage; fault-tolerant hypercube multiprocessor architecture; finite-precision arithmetic; matrix multiplication; reconfiguration strategy; spare links; system-level encodings; system-level fault tolerance; system-level fault-detection mechanisms; system-level fault-detection schemes; Algorithm design and analysis; Computer architecture; Costs; Fault detection; Fault diagnosis; Fault tolerance; Fault tolerant systems; Hypercubes; Logic; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1988. FTCS-18, Digest of Papers., Eighteenth International Symposium on
  • Conference_Location
    Tokyo, Japan
  • Print_ISBN
    0-8186-0867-6
  • Type

    conf

  • DOI
    10.1109/FTCS.1988.5344
  • Filename
    5344