• DocumentCode
    3335453
  • Title

    Algorithm-based diskless checkpointing for fault tolerant matrix operations

  • Author

    Plank, J.S. ; Youngbae Kim ; Dongarra, J.J.

  • Author_Institution
    Dept. of Comput. Sci., Tennessee Univ., TN, USA
  • fYear
    1995
  • fDate
    27-30 June 1995
  • Firstpage
    351
  • Lastpage
    360
  • Abstract
    The paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the "network of workstations" (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and preconditioned conjugate gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.<>
  • Keywords
    conjugate gradient methods; local area networks; matrix algebra; natural sciences computing; software fault tolerance; subroutines; workstations; Cholesky factorization; IBM SP2; LU factorization; PVM networks; QR factorization; SUN workstations; algorithm-based diskless checkpointing; distributed scientific computations; fault tolerant matrix operations; fault-tolerance; high-performance implementations; long-running scientific computations; low overhead; performance; preconditioned conjugate gradient; processors; workstation network platform; Algorithms; Checkpointing; Computer science; Contracts; Distributed computing; Fault tolerance; Lifting equipment; Scientific computing; Supercomputers; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on
  • Conference_Location
    Pasadena, CA, USA
  • Print_ISBN
    0-8186-7079-7
  • Type

    conf

  • DOI
    10.1109/FTCS.1995.466964
  • Filename
    466964