DocumentCode :
2980643
Title :
Supporting User-directed Fault Tolerance over Standard MPI
Author :
Zhimin Wu ; Rui Wang ; Weizhi Xu ; Mingyu Chen ; Erlin Yao
Author_Institution :
State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
fYear :
2012
fDate :
17-19 Dec. 2012
Firstpage :
696
Lastpage :
697
Abstract :
User-directed means the process of carrying out fault tolerance is dynamic and the fault tolerance mode is chosen by users based on application requirements. In this paper, we introduce a general scheme based on standard MPI to provide the user directed support for application level algorithmic fault tolerance. The user-directed fault tolerance plays the role as a connection between applications and algorithmic fault tolerance. As a case study, our scheme has been incorporated to HPL combined with a non-blocking ABFT technique. We have tested the functional availability of our scheme for fault tolerance in real circumstance. We also evaluated that when there is no failure occurring, our support only brings 2.5 percent overhead. When failure occurs, with our scheme, the scalability of algorithmic fault tolerance maintains well.
Keywords :
application program interfaces; fault tolerant computing; message passing; HPL; application level algorithmic fault tolerance; functional availability; nonblocking ABFT technique; standard MPI; user-directed fault tolerance mode; Algorithm design and analysis; Conferences; Detectors; Fault tolerance; Fault tolerant systems; Scalability; Standards; HPL; algorithmic fault tolerance; application-level; standard MPI; user-directed fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
Conference_Location :
Singapore
ISSN :
1521-9097
Print_ISBN :
978-1-4673-4565-1
Electronic_ISBN :
1521-9097
Type :
conf
DOI :
10.1109/ICPADS.2012.100
Filename :
6413632
Link To Document :
بازگشت