Title :
Fault tolerant matrix operations using checksum and reverse computation
Author :
Kim, Youngbae ; Plank, James S. ; Dongarra, Jack J.
Author_Institution :
Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
Abstract :
In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead
Keywords :
digital arithmetic; fault tolerant computing; matrix algebra; matrix multiplication; roundoff errors; Cholesky factorization; Hessenberg reduction; LU factorization; QR factorization; checkpointing; checksum; fault tolerant matrix operations; high-performance matrix operations; matrix multiplication; recovery; reverse computation; Availability; Checkpointing; Computer science; Fault tolerance; High performance computing; Lifting equipment; Linear programming; Performance analysis; Roundoff errors; Workstations;
Conference_Titel :
Frontiers of Massively Parallel Computing, 1996. Proceedings Frontiers '96., Sixth Symposium on the
Conference_Location :
Annapolis, MD
Print_ISBN :
0-8186-7551-9
DOI :
10.1109/FMPC.1996.558063