• DocumentCode
    3248243
  • Title

    Environmental-aware optimization of MPI checkpointing intervals

  • Author

    Jitsumoto, Hideyuki ; Endo, Toshio ; Matsuoka, Satoshi

  • Author_Institution
    Tokyo Inst. of Technol., Tokyo
  • fYear
    2008
  • fDate
    Sept. 29 2008-Oct. 1 2008
  • Firstpage
    326
  • Lastpage
    329
  • Abstract
    Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory. Therefore, automated optimization of the checkpoint interval is essential, but the optimal point depends on hardware failure rates and I/O bandwidth. Our new model and an algorithm, which is an extension of Vaidyapsilas model, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5.5% improvement over statically user-determined cases.
  • Keywords
    checkpointing; fault tolerant computing; message passing; optimisation; HPC systems; MPI checkpointing intervals; Vaidya model; collective application memory; environmental-aware optimization; fault-tolerance; rollback recovery; Bandwidth; Checkpointing; Cost function; Design optimization; Exponential distribution; Fault tolerance; Fault tolerant systems; Informatics; Prototypes; Supercomputers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing, 2008 IEEE International Conference on
  • Conference_Location
    Tsukuba
  • ISSN
    1552-5244
  • Print_ISBN
    978-1-4244-2639-3
  • Electronic_ISBN
    1552-5244
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2008.4663790
  • Filename
    4663790