• DocumentCode
    3172113
  • Title

    Experimental assessment of workstation failures and their impact on checkpointing systems

  • Author

    Plank, J.S. ; Elwasif, W.R.

  • Author_Institution
    Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
  • fYear
    1998
  • fDate
    23-25 June 1998
  • Firstpage
    48
  • Lastpage
    57
  • Abstract
    In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these results. We study three separate projects that monitor failure in workstation networks. Our goals are twofold. The first is to see how these results correlate with the theoretical results, and the second is to assess their impact on strategies for checkpointing long-running computations on workstations and networks of workstations. A significant result of our work is that although the base assumptions of the theoretical research do not hold, many of the results are still applicable.
  • Keywords
    computer network reliability; distributed processing; fault tolerant computing; local area networks; system recovery; workstations; checkpointing systems; local area network; program running time; rollback recovery; workstation failure; workstation network failure; Checkpointing; Computer networks; Computer science; Condition monitoring; Equations; Failure analysis; Probability distribution; Supercomputers; Utility programs; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
  • Conference_Location
    Munich, Germany
  • ISSN
    0731-3071
  • Print_ISBN
    0-8186-8470-4
  • Type

    conf

  • DOI
    10.1109/FTCS.1998.689454
  • Filename
    689454