DocumentCode :
1550686
Title :
The Reliability Wall for Exascale Supercomputing
Author :
Yang, Xuejun ; Wang, Zhiyuan ; Xue, Jingling ; Zhou, Yun
Author_Institution :
Nat. Lab. for Paralleling & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
Volume :
61
Issue :
6
fYear :
2012
fDate :
6/1/2012 12:00:00 AM
Firstpage :
767
Lastpage :
779
Abstract :
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.
Keywords :
checkpointing; fault tolerant computing; mainframes; parallel machines; ASCI White; availability; checkpointing; exascale supercomputing; fault-tolerance mechanisms; general reliability wall; hardware-software optimizations; large-scale supercomputing systems; parallel applications; petascale supercomputing; Checkpointing; Fault tolerance; Fault tolerant systems; Reliability theory; Software reliability; Strontium; Fault tolerance; checkpointing.; exascale; performance metric; reliability speedup; reliability wall;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2011.106
Filename :
5871590
Link To Document :
بازگشت