• DocumentCode
    1135160
  • Title

    Algorithm-Based Fault Tolerance for Fail-Stop Failures

  • Author

    Chen, Zizhong ; Dongarra, Jack

  • Author_Institution
    Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO
  • Volume
    19
  • Issue
    12
  • fYear
    2008
  • Firstpage
    1628
  • Lastpage
    1641
  • Abstract
    Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging.
  • Keywords
    checkpointing; distributed processing; matrix multiplication; software fault tolerance; ScaLAPACK; checksum matrix; checksum relationship; distributed environment; fail-stop failure; fault tolerance; matrix-matrix multiplication; processor miscalculation; product version algorithm; Mathematical Software; Parallel algorithms; Reliability and robustness;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2008.58
  • Filename
    4492768