مرکز منطقه ای اطلاع رساني علوم و فناوري - Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

DocumentCode :

2503911

Title :

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Author :

Gao, Qi ; Yu, Weikuan ; Huang, Wei ; Panda, Dhabaleswar K.

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH

fYear :

2006

fDate :

14-18 Aug. 2006

Firstpage :

471

Lastpage :

478

Abstract :

Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes critical for such systems to be equipped with fault tolerance support. In this paper, we present our design and implementation of checkpoint/restart framework for MPI programs running over InfiniBand clusters. Our design enables low-overhead, application-transparent checkpointing. It uses coordinated protocol to save the current state of the whole MPI job to reliable storage, which allows users to perform rollback recovery if the system runs into faulty states later. Our solution has been incorporated into MVAPICH2, an open-source high performance MPI-2 implementation over InfiniBand. Performance evaluation of this implementation has been carried out using NAS benchmarks, HPL benchmark, and a real-world application called GROMACS. Experimental results indicate that in our design, the overhead to take checkpoints is low, and the performance impact for checkpointing applications periodically is insignificant. For example, time for checkpointing GROMACS is less than 0.3% of the execution time, and its performance only decreases by 4% with checkpoints taken every minute. To the best of our knowledge, this work is the first report of checkpoint/restart support for MPI over InfiniBand clusters in the literature

Keywords :

checkpointing; fault tolerant computing; message passing; protocols; workstation clusters; InfiniBand clusters; MPI programs; application-transparent checkpoint; application-transparent restart; fault tolerance; open-source high performance MPI-2; rollback recovery; Checkpointing; Computer networks; Computer science; Costs; Fault tolerance; Fault tolerant systems; High performance computing; Laboratories; Protocols; Sun;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel Processing, 2006. ICPP 2006. International Conference on

Conference_Location :

Columbus, OH

ISSN :

0190-3918

Print_ISBN :

0-7695-2636-5

Type :

conf

DOI :

10.1109/ICPP.2006.26

Filename :

1690651

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2503911