• DocumentCode
    2933433
  • Title

    An evaluation of difference and threshold techniques for efficient checkpoints

  • Author

    Hogan, Sean ; Hammond, Jeff R. ; Chien, Andrew A.

  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    To ensure reliability, long-running and large-scale computations have long used checkpoint-and-restart techniques to preserve computational progress in case of soft or hard failures. These techniques can incur significant overhead, consuming as much as 15% of an application´s resources for the US DOE´s leadership-class systems, and these overheads are projected to grow in exascale systems which are likely to have lower IO to compute ratios and higher failure rates. We explore the use of differenced checkpoint and cutoff techniques to increase the effectiveness of Lempel-Ziv (gzip), and thereby reduce the size of checkpoints. We apply these techniques to several types of scientific checkpoint data from NWChem, a widely-used computational chemistry code. Our results show that while standard compression techniques (and even those customized for floating point data) yield modest compression ratios (≈1.2), differenced checkpoints and cutoffs are dramatically more successful, improving compression ratios by 50% to 1.55 to 3.15 for a variety of checkpoint data. If cutoffs in the differenced checkpoints are incorporated, these compression ratios can be increased further with cutoff of 10™7 yielding dramatic improvement in compression ratios greater than 100. These results suggest further exploration of these approaches are promising to reduce checkpoint (and resilience) overhead.
  • Keywords
    checkpointing; data compression; fault tolerant computing; Lempel-Ziv data compression; NWChem; US DOE leadership-class systems; checkpoint size reduction; checkpoint-and-restart technique; compression ratio; computational chemistry code; cutoff technique; differenced checkpoint; floating point data; gzip software; scientific checkpoint data; Arrays; Checkpointing; Chemistry; Compression algorithms; Computational modeling; Resilience; Throughput; NWChem; checkpoint; data compression; fault-tolerance; fpc; lempel-ziv; resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
  • Conference_Location
    Boston, MA
  • Print_ISBN
    978-1-4673-2264-5
  • Electronic_ISBN
    978-1-4673-2265-2
  • Type

    conf

  • DOI
    10.1109/DSNW.2012.6264674
  • Filename
    6264674