• DocumentCode
    3424051
  • Title

    User-triggered checkpointing: system-independent and scalable application recovery

  • Author

    Deconinck, Geert ; Lauwereins, Rudy

  • Author_Institution
    Electrotech. Dept., Katholieke Univ., Leuven, Heverlee, Belgium
  • fYear
    1997
  • fDate
    1-3 Jul 1997
  • Firstpage
    418
  • Lastpage
    423
  • Abstract
    User-triggered checkpointing and rollback is proposed as a system-independent and flexible way to integrate backward error recovery in long-running, computation-intensive message-passing applications on large parallel multicomputers. It employs library calls to coordinate the checkpointing, allowing a non-blocking and scalable approach that requires no protocol to save a consistent state because the coordination among the processes is implicit. The explicit indication of the checkpoint contents (i.e. the items of which the state must be saved) allows one to significantly reduce the amount of checkpoint data and the overhead. In contrast to other checkpointing approaches, the implementation does not rely on system-dependent features (like saving register-values or communication status) to save the state. Instead, re-executing the first part of the application brings the system-specific items into a consistent state with the rest of the checkpoint contents that is restored from the saved checkpoint data
  • Keywords
    fault tolerant computing; message passing; parallel machines; system recovery; backward error recovery; checkpoint contents; checkpoint data; message-passing applications; nonblocking approach; overhead reduction; parallel multicomputers; rollback; scalable application recovery; scalable approach; system-independent recovery; user-triggered checkpointing; Application software; Checkpointing; Computational modeling; Computer applications; Concurrent computing; High performance computing; Libraries; Power system modeling; Power system restoration; Protocols;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computers and Communications, 1997. Proceedings., Second IEEE Symposium on
  • Conference_Location
    Alexandria
  • Print_ISBN
    0-8186-7852-6
  • Type

    conf

  • DOI
    10.1109/ISCC.1997.616035
  • Filename
    616035