DocumentCode :
2527906
Title :
Compiler-Assisted Application-Level Checkpointing for MPI Programs
Author :
Yang, Xuejun ; Wang, Panfeng ; Fu, Hongyi ; Du, Yunfei ; Wang, Zhiyuan ; Jia, Jia
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha
fYear :
2008
fDate :
17-20 June 2008
Firstpage :
251
Lastpage :
259
Abstract :
Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing based on the analysis method. Based on the theoretical foundation, we implement a source-to-source precompiler (ALEC) to automate application-level checkpointing. Finally, we evaluate the performance of five FORTRAN/MPI programs which are transformed and integrated checkpointing features by ALEC on a 512-CPU cluster system. The experimental results show that i) the application-level checkpointing based on live-variable analysis for MPI programs can efficiently reduce the amount of checkpoint data, thereby decrease the overhead of checkpoint and restart; ii) ALEC is capable of automating application-level checkpointing correctly and effectively.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; optimising compilers; program diagnostics; software performance evaluation; CPU cluster system; MPI program; compiler-assisted application-level checkpointing automation; data saving; fault tolerance overhead; live-variable analysis method; optimization method; program performance evaluation; source-to-source precompiler; Application software; Automatic logic units; Checkpointing; Concurrent computing; Distributed computing; High performance computing; Optimization methods; Program processors; Programming profession; Scientific computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Distributed Computing Systems, 2008. ICDCS '08. The 28th International Conference on
Conference_Location :
Beijing
ISSN :
1063-6927
Print_ISBN :
978-0-7695-3172-4
Electronic_ISBN :
1063-6927
Type :
conf
DOI :
10.1109/ICDCS.2008.25
Filename :
4595890
Link To Document :
بازگشت