• DocumentCode
    1681075
  • Title

    Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments

  • Author

    Chen, Zizhong

  • Author_Institution
    MCIS Dept., Jacksonville State Univ., Jacksonville, AL
  • fYear
    2008
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    It has been proved in previous algorithm-based fault tolerance that, for matrix matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is used. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that this checksum relationship is not maintained in the middle of the computation for most algorithms for matrix matrix multiplication. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, this checksum relationship can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging.
  • Keywords
    fault tolerant computing; mathematics computing; matrix algebra; checkpointing; checksum relationship; fail-stop failures; fault tolerance; high performance distributed environments; matrix matrix multiplication; message logging; Argon; Checkpointing; Distributed computing; Error correction; Fault detection; Fault tolerance; High performance computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
  • Conference_Location
    Miami, FL
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-1693-6
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2008.4536158
  • Filename
    4536158