• DocumentCode
    625576
  • Title

    Optimizing Checkpoints Using NVM as Virtual Memory

  • Author

    Kannan, S. ; Gavrilovska, Ada ; Schwan, Karsten ; Milojicic, D.

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    29
  • Lastpage
    40
  • Abstract
    Rapid checkpointing will remain key functionality for next generation high end machines. This paper explores the use of node-local nonvolatile memories (NVM) such as phase-change memory, to provide frequent, low overhead checkpoints. By adapting existing multi-level checkpoint techniques, we devise new methods, termed NVM-checkpoints, that efficiently store checkpoints on both local and remote node NVM. The checkpoint frequencies are guided by failure models that capture the expected accessibility of such data after failure. To lower overheads, NVM-checkpoints reduce the NVM and interconnect bandwidth used with a novel pre-copy mechanism, which incrementally moves checkpoint data from DRAM to NVM before a local checkpoint is started. This reduces local checkpoint cost by limiting the instantaneous data volume moved at checkpoint time, thereby freeing bandwidth for use by applications. In fact, the pre-copy method can reduce peak interconnect usage up to 46%. Since our approach treats NVM as memory rather than as ´Ramdisk´, pre-copying can be generalized to directly move data to remote NVMs. This results in 40% faster application execution times compared to asynchronous approaches not using pre-copying.
  • Keywords
    DRAM chips; checkpointing; failure analysis; fault tolerance; information retrieval; optimisation; virtual machines; virtual storage; DRAM; checkpoint frequency; checkpoint optimization; data accessibility; failure model; local node NVM; multilevel checkpoint technique; next generation high end machine; nonvolatile memory; precopy mechanism; remote node NVM; virtual memory; Bandwidth; Checkpointing; Hardware; Nonvolatile memory; Peer-to-peer computing; Phase change materials; Random access memory; Checkpointing; Memory bandwidth; Non volatile memory (NVM); PCM; Pre-Copy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.69
  • Filename
    6569798