مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

DocumentCode :

1783379

Title :

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

Author :

Sheng Di ; Bouguerra, Mohamed Slim ; Bautista-Gomez, Leonardo ; Cappello, Franck

Author_Institution :

INRIA, Sophia-Antipolis, France

fYear :

2014

fDate :

19-23 May 2014

Firstpage :

1181

Lastpage :

1190

Abstract :

HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.

Keywords :

application program interfaces; checkpointing; fault tolerant computing; message passing; numerical analysis; parallel processing; pattern clustering; checkpoint levels; checkpoint overheads; elastic response; failure distributions; fault tolerance; large scale HPC applications; local memory; mathematical model; multilevel checkpoint model; multilevel restart; numerical computations; optimal checkpoint intervals; parallel application; petascale systems; real cluster environment; real-world MPI programs; remote lile system; remote memory; software RAID; transient uncorrectable memory errors; Computational modeling; Equations; Hardware; Iterative methods; Mathematical model; Optimization; Transient analysis; Checkpoint/Restart model; Resilience; exascale High Performance Computing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Processing Symposium, 2014 IEEE 28th International

Conference_Location :

Phoenix, AZ

ISSN :

1530-2075

Print_ISBN :

978-1-4799-3799-8

Type :

conf

DOI :

10.1109/IPDPS.2014.122

Filename :

6877346

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1783379