DocumentCode :
2189818
Title :
MMPI: A Scalable Fault Tolerance Mechanism for MPI Large Scale Parallel Computing
Author :
Wang, Zhiyuan ; Yang, Xuejun ; Zhou, Yun
Author_Institution :
Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2010
fDate :
June 29 2010-July 1 2010
Firstpage :
1251
Lastpage :
1256
Abstract :
At present, Checkpoint/Restart is one of the most popular fault tolerance mechanisms for large scale parallel computing. However, the time to save a global checkpoint reaches and even exceeds the mean-time-between-failures (MTBF) of the component when the performance of the system is between Peta(1015) and Exa(1018) flops, which limits the scalability of the parallel computing. In this paper, a scalable fault tolerance mechanism is designed for MPI-oriented large scale parallel computing, which not only can deal with the fail-stop faults concerned by Checkpoint/Restart, but also can deal with most data errors that are not perceived by hardware. Firstly, we define the concept of redundant-process cluster (RPC), design running techniques that support MMPI, and study the implementation of MMPI. Secondly, we present the models of fault tolerance parallel speedup, Lastly, we verify the validity and scalability of MMPI fault tolerance mechanism.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; MMPI; MPI large scale parallel computing; checkpoint-restart technique; data error; fault tolerance parallel speedup; redundant process cluster; scalable fault tolerance mechanism; Computational modeling; Fault tolerance; Fault tolerant systems; Network topology; Parallel processing; Scalability; Topology; Fault tolerance mechanism; MPI large scale parallel computing; scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on
Conference_Location :
Bradford
Print_ISBN :
978-1-4244-7547-6
Type :
conf
DOI :
10.1109/CIT.2010.226
Filename :
5577877
Link To Document :
بازگشت