DocumentCode
2043138
Title
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
Author
Chen, Zizhong ; Dongarra, Jack
Author_Institution
Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN
fYear
2006
fDate
25-29 April 2006
Abstract
As the size of today´s high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix computations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today´s high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations. To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failed processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead
Keywords
checkpointing; fault tolerant computing; matrix multiplication; parallel algorithms; PBLAS; ScaLAPACK; algorithm-based fault tolerance; block cyclic data distribution; checkpoint-free fault tolerance; failed processor; matrix-matrix multiplication kernel; parallel matrix computation; performance overhead; process failure; scalable high performance matrix computation; volatile computing platform; volatile resource; Application software; Checkpointing; Computer science; Concurrent computing; Distributed computing; Fault tolerance; Fault tolerant systems; High performance computing; Kernel; Laboratories;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location
Rhodes Island
Print_ISBN
1-4244-0054-6
Type
conf
DOI
10.1109/IPDPS.2006.1639333
Filename
1639333
Link To Document