• DocumentCode
    2788022
  • Title

    DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems

  • Author

    Ruscio, Joseph F. ; Heffner, Michael A. ; Varadarajan, Srinidhi

  • Author_Institution
    Dept. of Comput. Sci., Virginia Tech., VA
  • fYear
    2007
  • fDate
    26-30 March 2007
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications or the OS. It uses a new runtime mechanism for transparent incremental checkpointing that captures the least amount of state needed to maintain global consistency and provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. Performance results from the production-ready implementation show less than 5% overhead in real-world parallel applications with large memory footprints.
  • Keywords
    application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; system monitoring; DejaVu fault tolerance system; MPI code; communication architecture; distributed system automatic migration; distributed system automatic recovery; runtime mechanism; system failure; transparent incremental checkpointing; transparent parallel user-level checkpointing; Application software; Checkpointing; Computer networks; Computer science; Concurrent computing; Distributed computing; Fault tolerant systems; Laboratories; Runtime; Stability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    1-4244-0910-1
  • Electronic_ISBN
    1-4244-0910-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2007.370309
  • Filename
    4228037