Title :
Transient-Error Detection and Recovery via Reverse Computation and Checkpointing
Author :
Tan, Lanfang ; Tan, Qingping ; Xu, Jianjun ; Li, Jianli
Author_Institution :
Comput. Sch., Nat. Univ. of Defense Technol. Changsha, Changsha, China
Abstract :
The integration of error detection and recovery mechanisms becomes mandatory as the probability of the occurrence of transient errors increases. The current study proposes a software-based fault tolerant technique that achieves both detection and recovery. The proposed technique is based on two main mechanisms, namely, reverse computation and check pointing. This study is the first to introduce reverse computation for error detection by comparing the input data of the original computation and the output data of the reverse computation. Live variable analysis is introduced to reduce the overhead of the check pointing technique. A translation tool is implemented to make the original source code fault tolerant with automatic error detection and recovery abilities. Fault injection and performance overhead experiments are performed to evaluate the proposed technique. Experimental results show that most errors can be recovered with relatively low performance overhead.
Keywords :
checkpointing; software fault tolerance; software performance evaluation; checkpointing; fault injection; live variable analysis; performance overhead experiments; reverse computation; software-based fault tolerant technique; transient-error detection; transient-error recovery; translation tool; Checkpointing; Fault tolerance; Fault tolerant systems; Instruments; Registers; Transient analysis; checkpointing; checksum; error recovery; reverse computation; software fault tolerance;
Conference_Titel :
Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2893-7
DOI :
10.1109/ClusterW.2012.33