• DocumentCode
    1190937
  • Title

    Tests and tolerances for high-performance software-implemehted fault detection

  • Author

    Turmon, Michael ; Granat, Robert ; Katz, Daniel S. ; Lou, John Z.

  • Author_Institution
    Jet Propulsion Lab., California Inst. of Technol., Pasadena, CA, USA
  • Volume
    52
  • Issue
    5
  • fYear
    2003
  • fDate
    5/1/2003 12:00:00 AM
  • Firstpage
    579
  • Lastpage
    591
  • Abstract
    We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper, bounds on error.
  • Keywords
    error analysis; fault tolerant computing; parallel algorithms; roundoff errors; software fault tolerance; Fourier algorithms; algorithm-based fault tolerance methods; checksum methods; checksum tests compliance; common numerical algorithms; high-performance software-implemented fault detection; worst case upper bounds; Algorithms; Application software; Delay; Fault detection; Fault tolerance; Hardware; Single event transient; Single event upset; Software testing; Space technology;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2003.1197125
  • Filename
    1197125