• DocumentCode
    1684587
  • Title

    Scalable group-based checkpoint/restart for large-scale message-passing systems

  • Author

    Ho, Justin C Y ; Wang, Cho-Li ; Lau, Francis C M

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Hong Kong, Hong Kong
  • fYear
    2008
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed.
  • Keywords
    checkpointing; message passing; software fault tolerance; MPI; fault tolerance; large-scale message-passing systems; message logging; parallel computers; scalable group-based checkpoint-restart; system-level checkpointing; Application software; Checkpointing; Computer science; Concurrent computing; Costs; Failure analysis; Fault tolerance; Fault tolerant systems; Large-scale systems; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
  • Conference_Location
    Miami, FL
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-1693-6
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2008.4536302
  • Filename
    4536302