• DocumentCode
    1518509
  • Title

    The broadcast comparison model for on-line fault diagnosis in multicomputer systems: theory and implementation

  • Author

    Blough, Douglas M. ; Brown, Hongying W.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., California Univ., Irvine, CA, USA
  • Volume
    48
  • Issue
    5
  • fYear
    1999
  • fDate
    5/1/1999 12:00:00 AM
  • Firstpage
    470
  • Lastpage
    493
  • Abstract
    This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for t-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software
  • Keywords
    distributed algorithms; fault diagnosis; fault tolerant computing; multiprocessor interconnection networks; COSMOS; COmmon Spaceborne Multicomputer Operating System; JPL MAX multicomputer system; broadcast comparison model; diagnosability; distributed fault diagnosis; dynamic fault occurrence; fault-tolerant hardware; multicomputer systems; online fault diagnosis; polynomial-time diagnosability algorithm; simulator; t-diagnosable systems; Broadcasting; Fault diagnosis; Heuristic algorithms; Military computing; Operating systems; Performance evaluation; Physics computing; Polynomials; Power system modeling; Testing;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/12.769431
  • Filename
    769431