Title :
Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
Author_Institution :
MCIS Dept., Jacksonville State Univ., Jacksonville, AL
Abstract :
It has been proved in previous algorithm-based fault tolerance that, for matrix matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is used. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that this checksum relationship is not maintained in the middle of the computation for most algorithms for matrix matrix multiplication. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, this checksum relationship can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging.
Keywords :
fault tolerant computing; mathematics computing; matrix algebra; checkpointing; checksum relationship; fail-stop failures; fault tolerance; high performance distributed environments; matrix matrix multiplication; message logging; Argon; Checkpointing; Distributed computing; Error correction; Fault detection; Fault tolerance; High performance computing;
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
DOI :
10.1109/IPDPS.2008.4536158