DocumentCode :
2232047
Title :
Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip
Author :
Yu, Qiaoyan ; Cano, José ; Flich, José ; Ampadu, Paul
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of New Hampshire, Durham, NH, USA
fYear :
2012
fDate :
9-11 May 2012
Firstpage :
169
Lastpage :
176
Abstract :
High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.
Keywords :
circuit complexity; digital arithmetic; error detection; error statistics; fault tolerance; logic design; multiprocessing systems; network routing; network topology; network-on-chip; redundancy; IIR extraction method; LBDRhr logic; LBDRhr mechanism; XY routing algorithm; built-in high-radix topology; combined method; compulsory feature; connectivity; detection mechanism; error detection method; fault tolerance support; forbidden signal patterns; high-end MPSoC systems; high-end multiprocessor systems-on-chip; inherent information redundancy; internal nodes; long-range links; network diameter; network routers; network topology; permanent error control; permanent failure combinations; permanent link failures; popular high-radix topology; power consumption; residual error rate; router complexity; router failures; short-range links; thorough analyses; transient error control; transient error detection; transient errors; transient link; transient router errors; triple modular redundancy; Logic gates; Network topology; Redundancy; Routing; Topology; Transient analysis; Networks-on-chip; arbiter; fault tolerant; information redundancy; permanent error; reliability; transient error;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on
Conference_Location :
Copenhagen
Print_ISBN :
978-1-4673-0973-8
Type :
conf
DOI :
10.1109/NOCS.2012.27
Filename :
6209276
Link To Document :
بازگشت