مرکز منطقه ای اطلاع رساني علوم و فناوري - The Reliability Wall for Exascale Supercomputing

DocumentCode :

1550686

Title :

The Reliability Wall for Exascale Supercomputing

Author :

Yang, Xuejun ; Wang, Zhiyuan ; Xue, Jingling ; Zhou, Yun

Author_Institution :

Nat. Lab. for Paralleling & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China

Volume :

Issue :

fYear :

2012

fDate :

6/1/2012 12:00:00 AM

Firstpage :

767

Lastpage :

779

Abstract :

Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.

Keywords :

checkpointing; fault tolerant computing; mainframes; parallel machines; ASCI White; availability; checkpointing; exascale supercomputing; fault-tolerance mechanisms; general reliability wall; hardware-software optimizations; large-scale supercomputing systems; parallel applications; petascale supercomputing; Checkpointing; Fault tolerance; Fault tolerant systems; Reliability theory; Software reliability; Strontium; Fault tolerance; checkpointing.; exascale; performance metric; reliability speedup; reliability wall;

fLanguage :

English

Journal_Title :

Computers, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9340

Type :

jour

DOI :

10.1109/TC.2011.106

Filename :

5871590

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1550686