• DocumentCode
    438812
  • Title

    Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

  • Author

    Janakiraman, G. ; Santos, Jose Renato ; Subhraveti, Dinesh ; Turner, Yoshio

  • Author_Institution
    Hewlett-Packard Laboratories
  • fYear
    2005
  • fDate
    28 June-1 July 2005
  • Firstpage
    260
  • Lastpage
    269
  • Abstract
    We present a new distributed checkpoint-restart mechanism, Cruz, that works without requiring application, library, or base kernel modifications. This mechanism provides comprehensive support for checkpointing and restoring application state, both at user level and within the OS. Our implementation builds on Zap, a process migration mechanism, implemented as a Linux kernel module, which operates by interposing a thin layer between applications and the OS. In particular, we enable support for networked applications by adding migratable IP and MAC addresses, and checkpoint-restart of socket buffer state, socket options, and TCP state. We leverage this capability to devise a novel method for coordinated checkpoint-restart that is simpler than prior approaches. For instance, it eliminates the need to flush communication channels by exploiting the packet re-transmission behavior of TCP and existing OS support for packet filtering. Our experiments show that the overhead of coordinating checkpoint-restart is negligible, demonstrating the scalability of this approach.
  • Keywords
    Checkpointing; Communication channels; Filtering; Kernel; Libraries; Linux; Operating systems; Scalability; Sockets; TCPIP;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
  • Print_ISBN
    0-7695-2282-3
  • Type

    conf

  • DOI
    10.1109/DSN.2005.33
  • Filename
    1467800