DocumentCode :
2978905
Title :
Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors
Author :
Li, Xiaobin ; Gaudiot, Jean-Luc
Author_Institution :
Enterprise Microprocessor Group, Intel Corp.
fYear :
2006
fDate :
Dec. 2006
Firstpage :
315
Lastpage :
322
Abstract :
Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks
Keywords :
fault tolerant computing; multi-threading; processor scheduling; resource allocation; system recovery; SMT thread scheduling; deadlock prevention; resource allocation; simultaneous multi-threading processor; trade-off design; transient fault-tolerant; Costs; Error correction; Fault detection; Fault tolerance; Processor scheduling; Proposals; Redundancy; Surface-mount technology; System recovery; Yarn;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Computing, 2006. PRDC '06. 12th Pacific Rim International Symposium on
Conference_Location :
Riverside, CA
Print_ISBN :
0-7695-2724-8
Type :
conf
DOI :
10.1109/PRDC.2006.25
Filename :
4041917
Link To Document :
بازگشت