• DocumentCode
    1439591
  • Title

    Algorithm-based fault tolerance on a hypercube multiprocessor

  • Author

    Banerjee, Prithviraj ; Rahmeh, Joe T. ; Stunkel, Craig ; Nair, V.S. ; Roy, Kaushik ; Balasubramanian, Vijay ; Abraham, Jacob A.

  • Author_Institution
    Dept. of Electr. Eng., Illinois Univ., Urbana, IL, USA
  • Volume
    39
  • Issue
    9
  • fYear
    1990
  • fDate
    9/1/1990 12:00:00 AM
  • Firstpage
    1132
  • Lastpage
    1145
  • Abstract
    The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors
  • Keywords
    fault tolerant computing; multiprocessing systems; parallel architectures; Gaussian elimination; Intel iPSC hypercube; error detection; fast Fourier transform; fault tolerance; faulty processors; hypercube multiprocessor; matrix multiplication; multiprocessor architecture; Computer architecture; Computer errors; Costs; Fault detection; Fault diagnosis; Fault tolerance; Hypercubes; Jacobian matrices; Joining processes; Parallel architectures;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/12.57055
  • Filename
    57055