DocumentCode
1921241
Title
Mechanisms and Evaluation of Cross-Layer Fault-Tolerance for Supercomputing
Author
Ho, Chen-Han ; De Kruijf, Marc ; Sankaralingam, Karthikeyan ; Rountree, Barry ; Schulz, Martin ; De Supinski, Bronis R.
Author_Institution
Univ. of Wisconsin-Madison, Madison, WI, USA
fYear
2012
fDate
10-13 Sept. 2012
Firstpage
510
Lastpage
519
Abstract
Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient. This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and check pointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer fault-tolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07× to 2.5× speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.
Keywords
fault tolerant computing; parallel machines; checkpointing; cross-layer error tolerance; cross-layer fault tolerance; dual modular redundancy; hardware error detection; microprocessor reliability; reliable computing; supercomputing; Circuit faults; Fault tolerance; Fault tolerant systems; Hardware; Software; Supercomputers; Cross-Layer Fault Tolerance; HPC; Reliability; Supercomputing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing (ICPP), 2012 41st International Conference on
Conference_Location
Pittsburgh, PA
ISSN
0190-3918
Print_ISBN
978-1-4673-2508-0
Type
conf
DOI
10.1109/ICPP.2012.37
Filename
6337612
Link To Document