DocumentCode :
1681075
Title :
Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
Author :
Chen, Zizhong
Author_Institution :
MCIS Dept., Jacksonville State Univ., Jacksonville, AL
fYear :
2008
Firstpage :
1
Lastpage :
8
Abstract :
It has been proved in previous algorithm-based fault tolerance that, for matrix matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is used. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that this checksum relationship is not maintained in the middle of the computation for most algorithms for matrix matrix multiplication. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, this checksum relationship can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging.
Keywords :
fault tolerant computing; mathematics computing; matrix algebra; checkpointing; checksum relationship; fail-stop failures; fault tolerance; high performance distributed environments; matrix matrix multiplication; message logging; Argon; Checkpointing; Distributed computing; Error correction; Fault detection; Fault tolerance; High performance computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2008.4536158
Filename :
4536158
Link To Document :
بازگشت