DocumentCode :
2011801
Title :
Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study
Author :
Yanjing Li ; Cheng, Eddie ; Makar, Samy ; Mitra, Subhasish
Author_Institution :
Stanford Univ., Stanford, CA, USA
fYear :
2013
fDate :
6-13 Sept. 2013
Firstpage :
1
Lastpage :
10
Abstract :
Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.
Keywords :
failure analysis; fault diagnosis; integrated circuit layout; multiprocessing systems; system-on-chip; I-O controllers; OpenSPARC T2 case study; cache controllers; circuit aging; early-life failures; fault-free systems; hardware threads; industrial OpenSPARC T2 SoC; manufacturing defects; memory controllers; multicore SoC; multiple-faulty uncore components; network-on-chip; on-chip memories; permanent faults; post-layout chip-level area impact; power impact; processor cores; robust system-on-chips; single-faulty uncore component; uncore component self-repair; Circuit faults; Hardware; Maintenance engineering; Process control; Random access memory; Stress; System-on-chip;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Test Conference (ITC), 2013 IEEE International
Conference_Location :
Anaheim, CA
ISSN :
1089-3539
Type :
conf
DOI :
10.1109/TEST.2013.6651907
Filename :
6651907
Link To Document :
بازگشت