• DocumentCode
    2792546
  • Title

    The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

  • Author

    Hursey, Joshua ; Squyres, Jeffrey M. ; Mattox, Timothy I. ; Lumsdaine, Andrew

  • Author_Institution
    Open Syst. Lab., Indiana Univ., Bloomington, IN
  • fYear
    2007
  • fDate
    26-30 March 2007
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementations that incorporated fault tolerance capabilities have been limited by lack of modularity, scalability and usability. This paper presents the design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project. We identify the general capabilities required for distributed checkpoint/restart and realize these capabilities as extensible frameworks within Open MPI´s modular component architecture. Our design features an abstract interface for providing and accessing fault tolerance services without sacrificing performance, robustness, or flexibility. Although our implementation includes support for some initial checkpoint/restart mechanisms, the framework is meant to be extensible and to encourage experimentation of alternative techniques within a production quality MPI implementation.
  • Keywords
    application program interfaces; checkpointing; message passing; software architecture; software fault tolerance; software portability; Open MPI; checkpoint-restart process fault tolerance; modular component architecture; production quality MPI; system software; Application software; Fault tolerance; Fault tolerant systems; Laboratories; Libraries; Message passing; Open systems; Platform virtualization; Production; Robustness;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    1-4244-0910-1
  • Electronic_ISBN
    1-4244-0910-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2007.370605
  • Filename
    4228333