DocumentCode :
1867789
Title :
Fault tolerant matrix operations for networks of workstations using multiple checkpointing
Author :
Kim, Youngbae ; Plank, James S. ; Dongarra, Jack J.
Author_Institution :
Lawrence Berkeley Nat. Lab., California Univ., Berkeley, CA, USA
fYear :
1997
fDate :
28 Apr-2 May 1997
Firstpage :
460
Lastpage :
465
Abstract :
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, fault tolerance is incorporated into the matrix operations, making them resilient to any single process failure with low overhead. In this paper, we present a technique called multiple checkpointing that enables the matrix operations to tolerate a certain set of multiple processor failures by adding multiple checkpointing processors. Results of implementing this technique on a network of workstations show improvement in both the reliability of the computation and the performance of checkpointing
Keywords :
data integrity; distributed processing; mathematics computing; matrix algebra; software fault tolerance; workstations; algorithm-based approach; computational reliability; diskless checkpointing; fault-tolerant matrix operations; high-performance matrix operations; multiple checkpointing; multiple processor failures; overhead; performance; process failure; workstation networks; Availability; Checkpointing; Computer networks; Computer science; Contracts; Encoding; Fault tolerance; Laboratories; Lifting equipment; Workstations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing on the Information Superhighway, 1997. HPC Asia '97
Conference_Location :
Seoul
Print_ISBN :
0-8186-7901-8
Type :
conf
DOI :
10.1109/HPC.1997.592191
Filename :
592191
Link To Document :
بازگشت