• DocumentCode
    2499377
  • Title

    Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

  • Author

    Plank, James S.

  • Author_Institution
    Dept. of Comput. Sci., Tennessee Univ., Chattanooga, TN, USA
  • fYear
    1996
  • fDate
    23-25 Oct 1996
  • Firstpage
    76
  • Lastpage
    85
  • Abstract
    Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) fault-tolerance. The results of this paper will serve as a guide so that future implementations of coordinated checkpointing can allow their users to achieve the combination of performance and functionality that is right for their applications
  • Keywords
    computer networks; fault tolerant computing; performance evaluation; workstations; RAID techniques; SPARC-5 workstations; coarse-grained job swapping; coordinated checkpointers; fault-tolerance; networks of workstations; performance; process migration; workstations; Checkpointing; Clocks; Computer science; Fault tolerance; Fault tolerant systems; History; Marketing and sales; Optimization methods; Parallel programming; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 1996. Proceedings., 15th Symposium on
  • Conference_Location
    Nigara-on-the-Lake, Ont.
  • ISSN
    1060-9857
  • Print_ISBN
    0-8186-7481-4
  • Type

    conf

  • DOI
    10.1109/RELDIS.1996.559700
  • Filename
    559700