DocumentCode :
2143055
Title :
FaulTM: Error detection and recovery using Hardware Transactional Memory
Author :
Yalcin, Gulay ; Unsal, Osman ; Cristal, Adrian
Author_Institution :
Barcelona Supercomputing Center, Spain
fYear :
2013
fDate :
18-22 March 2013
Firstpage :
220
Lastpage :
225
Abstract :
Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM´s intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.
Keywords :
Checkpointing; Degradation; Hardware; Registers; Reliability engineering; Transient analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
Conference_Location :
Grenoble, France
ISSN :
1530-1591
Print_ISBN :
978-1-4673-5071-6
Type :
conf
DOI :
10.7873/DATE.2013.058
Filename :
6513504
Link To Document :
بازگشت