Title :
GiFT: Automating FTPA Implementation for MPI Programs
Author :
Fu, Hongyi ; Du, Yunfei ; Wang, Panfeng ; Jia, Jia ; Yang, Xuejun
Author_Institution :
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha
Abstract :
Fault tolerance is a critical issue in the arena of large-scale computing. The fault-tolerant parallel algorithm (FTPA) is an application-level technique for tolerating hardware failures. FTPA achieves fast failure recovery making use of parallel recomputing. However, it complicates the coding of the application program. This paper uses compiler technology to automate the design of FTPA, and introduces the implementation of a tool called GiFT (Get it Fault-Tolerant). GiFT utilizes the extended data-flow analysis to choose the state needed by failure recovery, exploits the parallel recomputing time model to compute the optimal number of recomputing processes, and uses parallelization technologies to generate parallel recomputing codes. The experimental results show that original MPI programs can be transformed into the FTPA counterparts by GiFT correctly, and the performance of GiFT-generated FTPA programs is comparable to the performance of hand-modified FTPA programs.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel algorithms; Get it Fault-Tolerant; GiFT; MPI programs; data-flow analysis; failure recovery; fault tolerance; fault-tolerant parallel algorithm; hardware failures tolerance; parallel recomputing; Application software; Concurrent computing; Data analysis; Distributed computing; Fault tolerance; Hardware; High performance computing; Large-scale systems; Parallel algorithms; Scientific computing;
Conference_Titel :
Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on
Conference_Location :
Melbourne, VIC
Print_ISBN :
978-0-7695-3434-3
DOI :
10.1109/ICPADS.2008.89