Title :
EnHTM: Exploiting Hardware Transaction Memory for Achieving Low-Cost Fault Tolerance
Author :
Jianli Li ; Qingping Tan ; Lanfang Tan
Author_Institution :
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Abstract :
Fault-tolerance has become an essential concern for processor designers due to increasing transient fault rates, even for the processors used in the mainstream computing. As the mainstream commodity market accepts only low-cost fault tolerance solutions, traditional high-end solutions are unacceptable due to their expensive overheads. This paper presents EnHTM, a hybrid software/hardware implemented low-cost fault tolerance solution for the serial programs running on commodity systems. EnHTM employs light-weight symptom-based mechanism to detect faults and recovers from faults using a minimally-modified Hardware Transactional Memory (HTM) which features lazy conflict detection, lazy data versioning. Compile-time analysis approach is also exploited to support larger transaction size, so that transient faults detected within long latency can be recovered. The evaluation experiment result shows that EnHTM can recover from 89.4%of catastrophic failures caused by transient faults, with a performance overhead of 2.6% in error-free executions on average.
Keywords :
fault diagnosis; fault tolerant computing; program compilers; system recovery; transaction processing; EnHTM; catastrophic failure; commodity system; compile-time analysis approach; error-free execution; fault detection; fault recovery; hardware transaction memory; hybrid software-hardware implemented low-cost fault tolerance solution; lazy conflict detection; lazy data versioning; light-weight symptom-based mechanism; mainstream commodity market; minimally-modified hardware transactional memory; performance overhead; processor design; serial program; transaction size; transient fault rate; Automation; Manufacturing; Compile-time analysis; HTM; Symptom-based mechanism; Transient faults;
Conference_Titel :
Digital Manufacturing and Automation (ICDMA), 2013 Fourth International Conference on
Conference_Location :
Qingdao
DOI :
10.1109/ICDMA.2013.130