• DocumentCode
    1920416
  • Title

    On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

  • Author

    Ibtesham, Dewan ; Arnold, Dorian ; Bridges, Patrick G. ; Ferreira, Kurt B. ; Brightwell, Ron

  • Author_Institution
    Dept. Of Comput. Sci., Univ. of New Mexico, Albuquerque, NM, USA
  • fYear
    2012
  • fDate
    10-13 Sept. 2012
  • Firstpage
    148
  • Lastpage
    157
  • Abstract
    The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems, (2) checkpoint compression viability scales with checkpoint size, (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability, and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.
  • Keywords
    checkpointing; data compression; software fault tolerance; HPC system; checkpoint commit latency; checkpoint compression viability scale; checkpoint data compression; checkpoint overhead; checkpoint size; checkpoint/restart mechanism; checkpoint/restart-based fault tolerance; fault frequency; high performance computing; scientific application; storage overhead; system-level checkpoints; user-level checkpoints; Benchmark testing; Checkpointing; Compression algorithms; Data compression; Fault tolerance; Libraries; Mathematical model; Checkpoint Compression; Fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2012 41st International Conference on
  • Conference_Location
    Pittsburgh, PA
  • ISSN
    0190-3918
  • Print_ISBN
    978-1-4673-2508-0
  • Type

    conf

  • DOI
    10.1109/ICPP.2012.45
  • Filename
    6337576