Title :
A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism
Author :
Yao, Erlin ; Wang, Rui ; Chen, Mingyu ; Tan, Guangming ; Sun, Ninghui
Author_Institution :
State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
Abstract :
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. Today´s HPC applications typically tolerate fail-stop failures by check pointing. However, check pointing will lose its efficiency when system becoming very large. An alternative method is algorithm-based fault recovery which has been proved to be more efficient than check pointing. In this paper, we first point out by theoretical analysis that algorithm-based fault recovery will also lose its efficiency when systems scale up to Exa flops. Then, a more efficient algorithm-based fault tolerance scheme for HPC applications at large scale is presented. The new method has two novel skills. One is algorithm-based hot replacement, which avoids the stop-and-wait time after failure. Second is background accelerated recovery, which guarantees the system to endure multiple failures in succession. As a case study, this method is incorporated to High Performance Lin pack (HPL). Theoretical analysis shows that the fault tolerance overhead can be reduced to 2/log(p, 2) of that of algorithm-based fault recovery method (p is the number of computation processes), so that the new method will still be efficient in Exascale. Experimental results for up to 1800 processes show that the overhead of the new method is about 25% of that of algorithm-based fault recovery method, which is close to the theoretical prediction.
Keywords :
checkpointing; failure analysis; parallel processing; software fault tolerance; software libraries; Exaflops; HPC applications; HPL; algorithm-based fault recovery; algorithm-based fault tolerant application; algorithm-based hot replacement; checkpointing; exascale parallelism; fail-stop failures; high performance Linpack; high performance computing applications; stop-and-wait time; Acceleration; Algorithm design and analysis; Checkpointing; Fault tolerant systems; Prediction algorithms; Redundancy; Algorithm-Based Fault Tolerance; Exascale; High Performance Linpack;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.48