DocumentCode
285513
Title
Algorithm-based fault tolerance for floating-point operations in massively parallel systems
Author
Rexford, Jennifer ; Jha, Niraj K.
Author_Institution
Dept. of EECS, Michigan Univ., Ann Arbor, MI, USA
Volume
2
fYear
1992
fDate
10-13 May 1992
Firstpage
649
Abstract
Considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. The authors propose the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes with respect to numerical stability and hardware/time overhead. The partitioned scheme is shown to provide scalable linear codes with improved numerical properties with only s small increase in hardware and time overhead. The partitioned approach prevents overflow in encoding and can preserve the reflectivity of codes, while guarding against roundoff error in encoding. The sharper bound on numerical encoding error allows the method to provide more complete fault coverage
Keywords
digital arithmetic; fault tolerant computing; parallel processing; ABFT; algorithm-based fault tolerance; floating-point operations; hardware/time overhead; massively parallel systems; numerical stability; overflow; partitioned linear encoding scheme; reflectivity; roundoff error; scalability; Computational efficiency; Concurrent computing; Encoding; Fault tolerance; Fault tolerant systems; Hardware; Linear code; Numerical stability; Partitioning algorithms; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Circuits and Systems, 1992. ISCAS '92. Proceedings., 1992 IEEE International Symposium on
Conference_Location
San Diego, CA
Print_ISBN
0-7803-0593-0
Type
conf
DOI
10.1109/ISCAS.1992.230168
Filename
230168
Link To Document