• DocumentCode
    3025774
  • Title

    EnHTM: Exploiting Hardware Transaction Memory for Achieving Low-Cost Fault Tolerance

  • Author

    Jianli Li ; Qingping Tan ; Lanfang Tan

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2013
  • fDate
    29-30 June 2013
  • Firstpage
    550
  • Lastpage
    554
  • Abstract
    Fault-tolerance has become an essential concern for processor designers due to increasing transient fault rates, even for the processors used in the mainstream computing. As the mainstream commodity market accepts only low-cost fault tolerance solutions, traditional high-end solutions are unacceptable due to their expensive overheads. This paper presents EnHTM, a hybrid software/hardware implemented low-cost fault tolerance solution for the serial programs running on commodity systems. EnHTM employs light-weight symptom-based mechanism to detect faults and recovers from faults using a minimally-modified Hardware Transactional Memory (HTM) which features lazy conflict detection, lazy data versioning. Compile-time analysis approach is also exploited to support larger transaction size, so that transient faults detected within long latency can be recovered. The evaluation experiment result shows that EnHTM can recover from 89.4%of catastrophic failures caused by transient faults, with a performance overhead of 2.6% in error-free executions on average.
  • Keywords
    fault diagnosis; fault tolerant computing; program compilers; system recovery; transaction processing; EnHTM; catastrophic failure; commodity system; compile-time analysis approach; error-free execution; fault detection; fault recovery; hardware transaction memory; hybrid software-hardware implemented low-cost fault tolerance solution; lazy conflict detection; lazy data versioning; light-weight symptom-based mechanism; mainstream commodity market; minimally-modified hardware transactional memory; performance overhead; processor design; serial program; transaction size; transient fault rate; Automation; Manufacturing; Compile-time analysis; HTM; Symptom-based mechanism; Transient faults;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Manufacturing and Automation (ICDMA), 2013 Fourth International Conference on
  • Conference_Location
    Qingdao
  • Type

    conf

  • DOI
    10.1109/ICDMA.2013.130
  • Filename
    6598051