• DocumentCode
    1153527
  • Title

    Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems

  • Author

    Banerjee, Prithviraj ; Abraham, Jacob A.

  • Author_Institution
    Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University of Illinois
  • Issue
    4
  • fYear
    1986
  • fDate
    4/1/1986 12:00:00 AM
  • Firstpage
    296
  • Lastpage
    306
  • Abstract
    An important consideration in the design of high- performance multiple processor systems should be in ensuring the correctness of results computed by such complex systems which are extremely prone to transient and intermittent failures. The detection and location of faults and errors concurrently with normal system operation can be achieved through the application of appropriate on-line checks on the results of the computations. This is the domain of algorithm-based fault tolerance, which deals with low-cost system-level fault-tolerance techniques to produce reliable computations in multiple processor systems, by tailoring the fault-tolerance techniques toward specific algorithms. This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location. The objective is to estimate ate the overhead in time and the number of processors required for such a scheme. Faults in processors, errors in the data, and checks on the data to detect and locate errors are represented as a tripartite graph. Bounds on the time and processor overhead are obtained by considering a series of subproblems. First, using some crude concepts for t-fault detection and t-fault location, bounds on the maximum size of the error patterns that can arise from such fault patterns are obtained. Using these results, bounds are derived on the number of checks required for error detection and location. Some numerical results are derived from a linear programming formulation.
  • Keywords
    Checks; errors; fault detection; fault location; graph model; linear programming; lower bounds; system-level faults; upper bounds; Computer errors; Concurrent computing; Fast Fourier transforms; Fault detection; Fault location; Fault tolerant systems; Hardware; Jacobian matrices; Linear programming; Upper bound; Checks; errors; fault detection; fault location; graph model; linear programming; lower bounds; system-level faults; upper bounds;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.1986.1676762
  • Filename
    1676762