• DocumentCode
    285513
  • Title

    Algorithm-based fault tolerance for floating-point operations in massively parallel systems

  • Author

    Rexford, Jennifer ; Jha, Niraj K.

  • Author_Institution
    Dept. of EECS, Michigan Univ., Ann Arbor, MI, USA
  • Volume
    2
  • fYear
    1992
  • fDate
    10-13 May 1992
  • Firstpage
    649
  • Abstract
    Considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. The authors propose the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes with respect to numerical stability and hardware/time overhead. The partitioned scheme is shown to provide scalable linear codes with improved numerical properties with only s small increase in hardware and time overhead. The partitioned approach prevents overflow in encoding and can preserve the reflectivity of codes, while guarding against roundoff error in encoding. The sharper bound on numerical encoding error allows the method to provide more complete fault coverage
  • Keywords
    digital arithmetic; fault tolerant computing; parallel processing; ABFT; algorithm-based fault tolerance; floating-point operations; hardware/time overhead; massively parallel systems; numerical stability; overflow; partitioned linear encoding scheme; reflectivity; roundoff error; scalability; Computational efficiency; Concurrent computing; Encoding; Fault tolerance; Fault tolerant systems; Hardware; Linear code; Numerical stability; Partitioning algorithms; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Circuits and Systems, 1992. ISCAS '92. Proceedings., 1992 IEEE International Symposium on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    0-7803-0593-0
  • Type

    conf

  • DOI
    10.1109/ISCAS.1992.230168
  • Filename
    230168