DocumentCode :
2052522
Title :
High Performance Dense Linear System Solver with Soft Error Resilience
Author :
Du, Peng ; Luszczek, Piotr ; Dongarra, Jack
Author_Institution :
Electr. Eng. & Comput. Sci. Dept., Univ. of Tennessee, Knoxville, TN, USA
fYear :
2011
fDate :
26-30 Sept. 2011
Firstpage :
272
Lastpage :
280
Abstract :
As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
Keywords :
checkpointing; matrix decomposition; software fault tolerance; Gaussian elimination; HPC systems; LU factorization; ScaLAPACK; Sherman-Morrison formula; algorithm based fault tolerance; check pointing algorithm; checkpoint-and-restart; cluster systems; error awareness; error propagation; high end computing systems; high performance dense linear system solver; mathematical model; rank-one perturbation; silent data corruption; soft error resilience; system failure; Algorithm design and analysis; Checkpointing; Equations; Fault tolerance; Fault tolerant systems; Linear systems; Mathematical model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2011 IEEE International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4577-1355-2
Electronic_ISBN :
978-0-7695-4516-5
Type :
conf
DOI :
10.1109/CLUSTER.2011.38
Filename :
6061145
Link To Document :
بازگشت