Title :
Performance evaluation of checksum-based ABFT
Author :
Al-Yamani, Ahmad A. ; Oh, Nahmsuk ; McCluskey, Edward J.
Author_Institution :
Center for Reliable Comput., Stanford Univ., CA, USA
Abstract :
In algorithm-based fault tolerance (ABFT), fault tolerance is tailored to the algorithm performed. Most of the previous studies that compared ABFT schemes considered only error detection and correction capabilities. Some previous studies looked at the overhead but no previous work compared different recovery schemes for data processing applications considering throughput as the main metric. We compare the performance of two recovery schemes: recomputing and ABFT correction, for different error rates. We consider errors that occur during computation as well as those that occur during error detection, location and correction processes. A metric for performance evaluation of different design alternatives is defined. Results show that multiple error correction using ABFT has poorer performance than single error correction even at high error rates. We also present, implement and evaluate early detection in ABFT. In early detection, we try to detect the errors that occur in the checksum calculation before starting the actual computation. Early detection improves throughput in cases of intensive computations and cases of high error rates
Keywords :
fault tolerant computing; performance evaluation; system recovery; algorithm-based fault tolerance; checksum calculation; checksum-based ABFT; error correction processes; error rates; intensive computations; performance evaluation; recomputing; recovery schemes; Data processing; Error analysis; Error correction; Fault detection; Fault tolerance; Fault tolerant systems; Matrix decomposition; Space technology; Tail; Throughput;
Conference_Titel :
Defect and Fault Tolerance in VLSI Systems, 2001. Proceedings. 2001 IEEE International Symposium on
Conference_Location :
San Francisco, CA
Print_ISBN :
0-7695-1203-8
DOI :
10.1109/DFTVS.2001.966800