DocumentCode :
2933433
Title :
An evaluation of difference and threshold techniques for efficient checkpoints
Author :
Hogan, Sean ; Hammond, Jeff R. ; Chien, Andrew A.
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
6
Abstract :
To ensure reliability, long-running and large-scale computations have long used checkpoint-and-restart techniques to preserve computational progress in case of soft or hard failures. These techniques can incur significant overhead, consuming as much as 15% of an application´s resources for the US DOE´s leadership-class systems, and these overheads are projected to grow in exascale systems which are likely to have lower IO to compute ratios and higher failure rates. We explore the use of differenced checkpoint and cutoff techniques to increase the effectiveness of Lempel-Ziv (gzip), and thereby reduce the size of checkpoints. We apply these techniques to several types of scientific checkpoint data from NWChem, a widely-used computational chemistry code. Our results show that while standard compression techniques (and even those customized for floating point data) yield modest compression ratios (≈1.2), differenced checkpoints and cutoffs are dramatically more successful, improving compression ratios by 50% to 1.55 to 3.15 for a variety of checkpoint data. If cutoffs in the differenced checkpoints are incorporated, these compression ratios can be increased further with cutoff of 10™7 yielding dramatic improvement in compression ratios greater than 100. These results suggest further exploration of these approaches are promising to reduce checkpoint (and resilience) overhead.
Keywords :
checkpointing; data compression; fault tolerant computing; Lempel-Ziv data compression; NWChem; US DOE leadership-class systems; checkpoint size reduction; checkpoint-and-restart technique; compression ratio; computational chemistry code; cutoff technique; differenced checkpoint; floating point data; gzip software; scientific checkpoint data; Arrays; Checkpointing; Chemistry; Compression algorithms; Computational modeling; Resilience; Throughput; NWChem; checkpoint; data compression; fault-tolerance; fpc; lempel-ziv; resilience;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-2264-5
Electronic_ISBN :
978-1-4673-2265-2
Type :
conf
DOI :
10.1109/DSNW.2012.6264674
Filename :
6264674
Link To Document :
بازگشت