• DocumentCode
    3178048
  • Title

    Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers

  • Author

    Blough, D.M. ; Torii, T.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., California Univ., Irvine, CA, USA
  • fYear
    1997
  • fDate
    24-27 June 1997
  • Firstpage
    258
  • Lastpage
    267
  • Abstract
    Distributed-memory parallel computers offer inherent redundancy that can be exploited to provide software-implemented fault tolerance. Numerous algorithms have been developed for fault-tolerant unicast communication, fault-tolerant broadcast, fault diagnosis, check-point/rollback, various consensus problems, algorithm-based fault tolerance, etc. Correctness proofs for these algorithms tend to be quite complex and, as a result, are error-prone. Furthermore, the way in which an algorithm is implemented can have dramatic impact on its correctness. Fault-injection-based testing is, therefore, an essential component of the validation procedure for these algorithms, which can complement other methods such as formal verification. The authors present a methodology for fault injection in distributed-memory parallel computers that use a message-passing paradigm. Their approach is based on injection of faults into interprocessor communications, and allows emulation of fault models commonly used in design of fault-tolerant parallel algorithms. The methodology has been applied in a tool for fault injection in Intel iPSC/860 multicomputers, and has been demonstrated through the extensive testing of a fault-tolerant broadcast algorithm.
  • Keywords
    computer testing; distributed memory systems; fault tolerant computing; formal verification; message passing; parallel algorithms; parallel machines; redundancy; reliability; Intel iPSC/860 multicomputers; algorithm validation; correctness proofs; distributed-memory parallel computers; fault model emulation; fault-injection-based testing; fault-tolerant algorithms; fault-tolerant broadcast algorithm; fault-tolerant parallel algorithms; formal verification; inherent redundancy; interprocessor communications; message-passing paradigm; message-passing parallel computers; software-implemented fault tolerance; Broadcasting; Concurrent computing; Distributed computing; Error correction; Fault diagnosis; Fault tolerance; Formal verification; Redundancy; Testing; Unicast;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1997. FTCS-27. Digest of Papers., Twenty-Seventh Annual International Symposium on
  • Conference_Location
    Seattle, WA, USA
  • ISSN
    0731-3071
  • Print_ISBN
    0-8186-7831-3
  • Type

    conf

  • DOI
    10.1109/FTCS.1997.614098
  • Filename
    614098