DocumentCode :
1783379
Title :
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
Author :
Sheng Di ; Bouguerra, Mohamed Slim ; Bautista-Gomez, Leonardo ; Cappello, Franck
Author_Institution :
INRIA, Sophia-Antipolis, France
fYear :
2014
fDate :
19-23 May 2014
Firstpage :
1181
Lastpage :
1190
Abstract :
HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; numerical analysis; parallel processing; pattern clustering; checkpoint levels; checkpoint overheads; elastic response; failure distributions; fault tolerance; large scale HPC applications; local memory; mathematical model; multilevel checkpoint model; multilevel restart; numerical computations; optimal checkpoint intervals; parallel application; petascale systems; real cluster environment; real-world MPI programs; remote lile system; remote memory; software RAID; transient uncorrectable memory errors; Computational modeling; Equations; Hardware; Iterative methods; Mathematical model; Optimization; Transient analysis; Checkpoint/Restart model; Resilience; exascale High Performance Computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
Conference_Location :
Phoenix, AZ
ISSN :
1530-2075
Print_ISBN :
978-1-4799-3799-8
Type :
conf
DOI :
10.1109/IPDPS.2014.122
Filename :
6877346
Link To Document :
بازگشت