Title :
Decentralized run-time recovery mechanism for transient and permanent hardware faults for space-borne FPGA-based computing systems
Author :
Dumitriu, Victor ; Kirischian, Lev ; Kirischian, Valeri
Author_Institution :
Dept. of Electical & Comput. Eng., Ryerson Univ., Toronto, ON, Canada
Abstract :
One of the most important problems for mission critical space-borne computing systems employing FPGA devices is fault tolerance to transient and permanent hardware faults. In many cases, the ability for run-time self-recovery from such faults is a vital feature. This paper presents a method and mechanism for run-time recovery of FPGA-based System-on-Chip (SoC) based on Collaborative Macro-Function Units (CMFUs). Each CMFU consist of a macro-function specific data-path, control unit and circuits providing self-integration, self-synchronization and self-recovery functions for the CMFU, without centralized control. The proposed mechanism allows run-time scrubbing or relocation of faulty components of the SoC providing much higher flexibility and reliability of the system. This mechanism was implemented and tested on a Xilinx Kintex-7 FPGA platform. It was determined that the proposed approach can provide seamless run-time recovery for pipelined SoCs, while being transparent to the application.
Keywords :
aerospace computing; fault tolerant computing; field programmable gate arrays; pipeline processing; system recovery; system-on-chip; CMFU; FPGA devices; Xilinx Kintex-7 FPGA platform; collaborative macro-function units; control unit; decentralized run-time recovery mechanism; fault tolerance; faulty component relocation; macro-function specific data-path; mission critical space-borne computing systems; permanent hardware faults; pipelined SoCs; run-time scrubbing; run-time self-recovery; self-integration functions; self-recovery functions; self-synchronization functions; space-borne FPGA-based computing systems; system flexibility; system reliability; system-on-chip; transient hardware faults; Built-in self-test; Circuit faults; Field programmable gate arrays; Hardware; IP networks; System-on-chip; Transient analysis; fault-tolerance; reconfigurable computing; self-recovery; space-borne FPGA systems;
Conference_Titel :
Adaptive Hardware and Systems (AHS), 2014 NASA/ESA Conference on
Conference_Location :
Leicester
DOI :
10.1109/AHS.2014.6880157