Title :
Fault-tolerance for exascale systems
Author :
Varela, Maria Ruiz ; Ferreira, Kurt B. ; Riesen, Rolf
Abstract :
Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.
Keywords :
checkpointing; parallel programming; software fault tolerance; CR; HPC; SST; coordinated checkpoint; coordinated restart; exascale systems; fault tolerance method; high performance computing; message logging; redundant computation; structural simulation toolkit; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Generators; Protocols; USA Councils;
Conference_Titel :
Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on
Conference_Location :
Heraklion, Crete
Print_ISBN :
978-1-4244-8395-2
Electronic_ISBN :
978-1-4244-8397-6
DOI :
10.1109/CLUSTERWKSP.2010.5613081