DocumentCode :
285513
Title :
Algorithm-based fault tolerance for floating-point operations in massively parallel systems
Author :
Rexford, Jennifer ; Jha, Niraj K.
Author_Institution :
Dept. of EECS, Michigan Univ., Ann Arbor, MI, USA
Volume :
2
fYear :
1992
fDate :
10-13 May 1992
Firstpage :
649
Abstract :
Considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. The authors propose the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes with respect to numerical stability and hardware/time overhead. The partitioned scheme is shown to provide scalable linear codes with improved numerical properties with only s small increase in hardware and time overhead. The partitioned approach prevents overflow in encoding and can preserve the reflectivity of codes, while guarding against roundoff error in encoding. The sharper bound on numerical encoding error allows the method to provide more complete fault coverage
Keywords :
digital arithmetic; fault tolerant computing; parallel processing; ABFT; algorithm-based fault tolerance; floating-point operations; hardware/time overhead; massively parallel systems; numerical stability; overflow; partitioned linear encoding scheme; reflectivity; roundoff error; scalability; Computational efficiency; Concurrent computing; Encoding; Fault tolerance; Fault tolerant systems; Hardware; Linear code; Numerical stability; Partitioning algorithms; Scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Circuits and Systems, 1992. ISCAS '92. Proceedings., 1992 IEEE International Symposium on
Conference_Location :
San Diego, CA
Print_ISBN :
0-7803-0593-0
Type :
conf
DOI :
10.1109/ISCAS.1992.230168
Filename :
230168
Link To Document :
بازگشت