مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

2720998

Title :

Fault-tolerance for exascale systems

Author :

Varela, Maria Ruiz ; Ferreira, Kurt B. ; Riesen, Rolf

fYear :

2010

fDate :

20-24 Sept. 2010

Firstpage :

Lastpage :

Abstract :

Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.

Keywords :

checkpointing; parallel programming; software fault tolerance; CR; HPC; SST; coordinated checkpoint; coordinated restart; exascale systems; fault tolerance method; high performance computing; message logging; redundant computation; structural simulation toolkit; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Generators; Protocols; USA Councils;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on

Conference_Location :

Heraklion, Crete

Print_ISBN :

978-1-4244-8395-2

Electronic_ISBN :

978-1-4244-8397-6

Type :

conf

DOI :

10.1109/CLUSTERWKSP.2010.5613081

Filename :

5613081

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2720998