• DocumentCode
    625575
  • Title

    Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal

  • Author

    Nicolae, Bogdan

  • Author_Institution
    Exascale Syst. Group, IBM Res., Dublin, Ireland
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    19
  • Lastpage
    28
  • Abstract
    With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
  • Keywords
    checkpointing; storage management; collective inline memory content deduplication; global deduplication; scalable checkpoint restart; storage space consumption; Bandwidth; Checkpointing; Computer architecture; Fault tolerance; Fault tolerant systems; Load management; Proposals; I/O load balancing; checkpoint restart; deduplication; fault tolerance; high performance computing; memory checkpointing; scientific computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.14
  • Filename
    6569797