• DocumentCode
    3118256
  • Title

    HHC: Hierarchical hardware checkpointing to accelerate fault recovery for SRAM-based FPGAs

  • Author

    Enshan Yang ; Keheng Huang ; Yu Hu ; Xiaowei Li ; Jian Gong ; Hongjin Liu ; Bo Liu

  • Author_Institution
    State Key Lab. of Comput. Archit., Inst. of Comput. Technol., China
  • fYear
    2013
  • fDate
    8-10 July 2013
  • Firstpage
    193
  • Lastpage
    198
  • Abstract
    As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulnerable to soft errors. Checkpointing is an effective fault recovery technique that can restore the faulty system to its previous fault free state. Since the function of the system needs to be suspended during checkpoint saving and checkpoint restoring, so the Mean Time to Repair (MTTR) of the system is critical to the system performance. In this work, we propose a hierarchical hardware checkpointing (HHC) technique that contains a high-speed on-chip checkpoint and a low-speed off-chip checkpoint to accelerate fault recovery for SRAM-based FPGAs. Most of single event effect (SEE) faults can be recovered by the high-speed on-chip checkpoint, which significantly reduces the MTTR of the system. The memory resource occupation of the on-chip checkpoint is low because HHC only stores the logic states of user bits and check information for configuration bits. Experimental results show that, compared with traditional off-chip checkpoint strategies, the proposed technique can reduce the MTTR of the system by 94.30%. In addition, the memory resource occupation is 11.11% of FPGAs, a little high but can be further optimized.
  • Keywords
    SRAM chips; checkpointing; field programmable gate arrays; HHC; MTTR; SRAM-based FPGA; checkpoint restoring; checkpoint saving; fault recovery technique; hierarchical hardware checkpointing technique; high-speed on-chip checkpoint; low-speed off-chip checkpoint; mean time to repair; single event effect faults; soft errors; Bandwidth; Checkpointing; Circuit faults; Error correction codes; Field programmable gate arrays; Hardware; System-on-chip; ECC; MTTR; SRAM-based FPGAs; fault recovery; hardware checkpoint; hierarchical;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    On-Line Testing Symposium (IOLTS), 2013 IEEE 19th International
  • Conference_Location
    Chania
  • Type

    conf

  • DOI
    10.1109/IOLTS.2013.6604078
  • Filename
    6604078