Title :
High Performance Dense Linear System Solver with Soft Error Resilience
Author :
Du, Peng ; Luszczek, Piotr ; Dongarra, Jack
Author_Institution :
Electr. Eng. & Comput. Sci. Dept., Univ. of Tennessee, Knoxville, TN, USA
Abstract :
As the scale of modern high end computing systems continues to grow rapidly, system failure has become an issue that requires a better solution than the commonly used scheme of checkpoint and restart (C/R). While hard errors have been studied extensively over the years, soft errors are still under-studied especially for modern HPC systems, and in some scientific applications C/R is not applicable for soft error at all due to error propagation and lack of error awareness. In this work, we propose an algorithm based fault tolerance (ABFT) for high performance dense linear system solver with soft error resilience. By adapting a mathematical model that treats soft error during LU factorization as rank-one perturbation, the solution of Ax=b can be recovered with the Sherman-Morrison formula. Our contribution includes extending error model from Gaussian elimination and pair wise pivoting to LU with partial pivoting, and we provide a practical numerical bound for error detection and a scalable check pointing algorithm to protect the left factor that is needed for recovering x from soft error. Experimental results on cluster systems with ScaLAPACK show that the fault tolerance functionality adds little overhead to the linear system solving and scales well on such systems.
Keywords :
checkpointing; matrix decomposition; software fault tolerance; Gaussian elimination; HPC systems; LU factorization; ScaLAPACK; Sherman-Morrison formula; algorithm based fault tolerance; check pointing algorithm; checkpoint-and-restart; cluster systems; error awareness; error propagation; high end computing systems; high performance dense linear system solver; mathematical model; rank-one perturbation; silent data corruption; soft error resilience; system failure; Algorithm design and analysis; Checkpointing; Equations; Fault tolerance; Fault tolerant systems; Linear systems; Mathematical model;
Conference_Titel :
Cluster Computing (CLUSTER), 2011 IEEE International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4577-1355-2
Electronic_ISBN :
978-0-7695-4516-5
DOI :
10.1109/CLUSTER.2011.38