DocumentCode
2933433
Title
An evaluation of difference and threshold techniques for efficient checkpoints
Author
Hogan, Sean ; Hammond, Jeff R. ; Chien, Andrew A.
fYear
2012
fDate
25-28 June 2012
Firstpage
1
Lastpage
6
Abstract
To ensure reliability, long-running and large-scale computations have long used checkpoint-and-restart techniques to preserve computational progress in case of soft or hard failures. These techniques can incur significant overhead, consuming as much as 15% of an application´s resources for the US DOE´s leadership-class systems, and these overheads are projected to grow in exascale systems which are likely to have lower IO to compute ratios and higher failure rates. We explore the use of differenced checkpoint and cutoff techniques to increase the effectiveness of Lempel-Ziv (gzip), and thereby reduce the size of checkpoints. We apply these techniques to several types of scientific checkpoint data from NWChem, a widely-used computational chemistry code. Our results show that while standard compression techniques (and even those customized for floating point data) yield modest compression ratios (≈1.2), differenced checkpoints and cutoffs are dramatically more successful, improving compression ratios by 50% to 1.55 to 3.15 for a variety of checkpoint data. If cutoffs in the differenced checkpoints are incorporated, these compression ratios can be increased further with cutoff of 10™7 yielding dramatic improvement in compression ratios greater than 100. These results suggest further exploration of these approaches are promising to reduce checkpoint (and resilience) overhead.
Keywords
checkpointing; data compression; fault tolerant computing; Lempel-Ziv data compression; NWChem; US DOE leadership-class systems; checkpoint size reduction; checkpoint-and-restart technique; compression ratio; computational chemistry code; cutoff technique; differenced checkpoint; floating point data; gzip software; scientific checkpoint data; Arrays; Checkpointing; Chemistry; Compression algorithms; Computational modeling; Resilience; Throughput; NWChem; checkpoint; data compression; fault-tolerance; fpc; lempel-ziv; resilience;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location
Boston, MA
Print_ISBN
978-1-4673-2264-5
Electronic_ISBN
978-1-4673-2265-2
Type
conf
DOI
10.1109/DSNW.2012.6264674
Filename
6264674
Link To Document