مرکز منطقه ای اطلاع رساني علوم و فناوري - Algorithm-based fault tolerance for floating-point operations in massively parallel systems

DocumentCode :

285513

Title :

Algorithm-based fault tolerance for floating-point operations in massively parallel systems

Author :

Rexford, Jennifer ; Jha, Niraj K.

Author_Institution :

Dept. of EECS, Michigan Univ., Ann Arbor, MI, USA

Volume :

fYear :

1992

fDate :

10-13 May 1992

Firstpage :

649

Abstract :

Considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. The authors propose the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes with respect to numerical stability and hardware/time overhead. The partitioned scheme is shown to provide scalable linear codes with improved numerical properties with only s small increase in hardware and time overhead. The partitioned approach prevents overflow in encoding and can preserve the reflectivity of codes, while guarding against roundoff error in encoding. The sharper bound on numerical encoding error allows the method to provide more complete fault coverage

Keywords :

digital arithmetic; fault tolerant computing; parallel processing; ABFT; algorithm-based fault tolerance; floating-point operations; hardware/time overhead; massively parallel systems; numerical stability; overflow; partitioned linear encoding scheme; reflectivity; roundoff error; scalability; Computational efficiency; Concurrent computing; Encoding; Fault tolerance; Fault tolerant systems; Hardware; Linear code; Numerical stability; Partitioning algorithms; Scalability;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Circuits and Systems, 1992. ISCAS '92. Proceedings., 1992 IEEE International Symposium on

Conference_Location :

San Diego, CA

Print_ISBN :

0-7803-0593-0

Type :

conf

DOI :

10.1109/ISCAS.1992.230168

Filename :

230168

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=285513