Title :
Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition
Author :
Hakkarinen, Doug ; Panruo Wu ; Zizhong Chen
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
Abstract :
Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric and positive definite coefficient matrix. With large matrices, this often will be performed on high performance supercomputers with a large number of processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processes. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The proposed fault tolerance scheme is incorporated into ScaLAPACK and validated on the supercomputer Kraken. Experimental results demonstrate that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.
Keywords :
fault tolerant computing; matrix decomposition; multiprocessing systems; Cholesky decomposition; Kraken supercomputer; ScaLAPACK; algorithm-based fault tolerance; fail-stop failure algorithm; fail-stop failure recovery; failure probability; fault tolerance scheme; fault tolerant Cholesky factorization algorithm; high performance supercomputers; linear equation; symmetric positive definite coefficient matrix; Algorithm design and analysis; Checkpointing; Fault tolerance; Fault tolerant systems; Matrix decomposition; Program processors; Symmetric matrices; Algorithm based fault tolerance (ABFT); cholesky decomposition; extreme-scale systems; fail-stop failures;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2014.2320502