Title :
POSTER: Energy-performance tradeoffs in multilevel checkpoint strategies
Author :
Bautista Gomez, Leonardo A. ; Balaprakash, Prasanna ; Bouguerra, Mohamed-Slim ; Wild, Stefan M. ; Cappello, Franck ; Hovland, Paul D.
Author_Institution :
Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
Abstract :
Increased complexity of computer architectures, consideration of power constraints, and expected failure rates of hardware components make the design and analysis of energy-efficient fault-tolerance schemes an increasingly challenging and important task. We develop run-time and study FTI, a multilevel checkpoint library, on an IBM Blue Gene/Q. We show that FTI has a low energy footprint and that, consequently optimal checkpoint-interval values with respect to time and energy are similar.
Keywords :
checkpointing; parallel machines; software fault tolerance; FTI; HPC; IBM Blue Gene/Q; energy-performance tradeoffs; fault-tolerance schemes; high-performance computing; multilevel checkpoint library; Checkpointing; Complexity theory; Encoding; Laboratories; Libraries; Power demand; Power measurement;
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
DOI :
10.1109/CLUSTER.2014.6968749