Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

Author

Bouguerra, Mohamed Slim ; Gainaru, Ana ; Gomez, Leonardo Bautista ; Cappello, Franck ; Matsuoka, Shingo ; Maruyama, Naoya

fYear

2013

fDate

20-24 May 2013

Firstpage

501

Lastpage

512

Abstract

As the failure frequency is increasing with the components count in modern and future supercomputers, resilience is becoming critical for extreme scale systems. The association of failure prediction with proactive checkpointing seeks to reduce the effect of failures in the execution time of parallel applications. Unfortunately, proactive checkpointing does not systematically avoid restarting from scratch. To mitigate this issue, failure prediction and proactive checkpointing can be coupled with periodic checkpointing. However, blind use of these techniques does not always improves system efficiency, because everyone of them comes with a mix of overheads and benefits. In order to study and understand the combination of these techniques and their improvement in the system´s efficiency, we developed: (i) a prototype combining state of the art failure prediction, fast proactive checkpointing and preventive checkpointing; (ii) a mathematical model that reflects the expected computing efficiency of the combination and computes the optimal checkpointing interval in this context; (iii) a discrete event simulator to evaluate the computing efficiency of the combination for system parameters corresponding to the current and projected large scale HPC systems. We evaluate our proposed technique on a large supercomputer (i.e. TSUBAME2) with production-level HPC applications and we show that failure prediction, proactive and preventive checkpointing can be coupled successfully, imposing only about 2% to 6% of overhead in comparison with preventive checkpointing only. Moreover, our model-based simulations show that the optimal solution improves the computing efficiency up to 30% in comparison with classic periodic checkpointing. We show that the prediction recall has a much higher impact on execution efficiency than the prediction precision. This result suggests that researchers on failure prediction algorithms should focus on improving the recall. We also show that the combinati- n of these techniques can significantly improve (by a factor 2, for a particular configuration) the mean time between failures (MTBF) perceived by the application.

Keywords

checkpointing; fault tolerant computing; parallel machines; HPC systems computing efficiency; MTBF; discrete event simulator; extreme scale systems; failure frequency; failure prediction; large scale HPC systems; mean time between failures; model-based simulations; periodic checkpointing; preventive checkpointing; proactive checkpointing; production-level HPC applications; supercomputers; Checkpointing; Computational modeling; Correlation; Fault tolerance; Fault tolerant systems; Mathematical model; Predictive models; Failure prediction; large scale HPC systems; multilevel checkpointing; resilience;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on

Conference_Location

Boston, MA

ISSN

1530-2075

Print_ISBN

978-1-4673-6066-1

Type

conf

DOI

10.1109/IPDPS.2013.74

Filename

6569837

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=625613