DocumentCode :
2959013
Title :
A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism
Author :
Yao, Erlin ; Wang, Rui ; Chen, Mingyu ; Tan, Guangming ; Sun, Ninghui
Author_Institution :
State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
438
Lastpage :
448
Abstract :
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. Today´s HPC applications typically tolerate fail-stop failures by check pointing. However, check pointing will lose its efficiency when system becoming very large. An alternative method is algorithm-based fault recovery which has been proved to be more efficient than check pointing. In this paper, we first point out by theoretical analysis that algorithm-based fault recovery will also lose its efficiency when systems scale up to Exa flops. Then, a more efficient algorithm-based fault tolerance scheme for HPC applications at large scale is presented. The new method has two novel skills. One is algorithm-based hot replacement, which avoids the stop-and-wait time after failure. Second is background accelerated recovery, which guarantees the system to endure multiple failures in succession. As a case study, this method is incorporated to High Performance Lin pack (HPL). Theoretical analysis shows that the fault tolerance overhead can be reduced to 2/log(p, 2) of that of algorithm-based fault recovery method (p is the number of computation processes), so that the new method will still be efficient in Exascale. Experimental results for up to 1800 processes show that the overhead of the new method is about 25% of that of algorithm-based fault recovery method, which is close to the theoretical prediction.
Keywords :
checkpointing; failure analysis; parallel processing; software fault tolerance; software libraries; Exaflops; HPC applications; HPL; algorithm-based fault recovery; algorithm-based fault tolerant application; algorithm-based hot replacement; checkpointing; exascale parallelism; fail-stop failures; high performance Linpack; high performance computing applications; stop-and-wait time; Acceleration; Algorithm design and analysis; Checkpointing; Fault tolerant systems; Prediction algorithms; Redundancy; Algorithm-Based Fault Tolerance; Exascale; High Performance Linpack;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-4673-0975-2
Type :
conf
DOI :
10.1109/IPDPS.2012.48
Filename :
6267880
Link To Document :
بازگشت