• DocumentCode
    3013859
  • Title

    The average availability of parallel checkpointing systems and its importance in selecting runtime parameters

  • Author

    Plank, J.S. ; Thomason, M.G.

  • Author_Institution
    Dept. of Comput. Sci., Tennessee Univ., Knoxville, TN, USA
  • fYear
    1999
  • fDate
    15-18 June 1999
  • Firstpage
    250
  • Lastpage
    257
  • Abstract
    Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular the issue of processor allocation is typically ignored. In this paper we briefly present it performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today´s parallel computing environments and software, and present case studies of using the model to select runtime parameters.
  • Keywords
    parallel programming; software fault tolerance; software performance evaluation; checkpointing systems; parallel checkpointing systems; parallel computing; performance models; processor allocation; runtime parameters; Checkpointing; Computer science; Distributed computing; Electrical capacitance tomography; Electronic switching systems; Parallel processing; Runtime;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on
  • Conference_Location
    Madison, WI, USA
  • ISSN
    0731-3071
  • Print_ISBN
    0-7695-0213-X
  • Type

    conf

  • DOI
    10.1109/FTCS.1999.781059
  • Filename
    781059