Title of article :
Optimizing Checkpoint Restart with Data Deduplication
Author/Authors :
Chen, Zhengyu College of Computer Science and Electronic Engineering - Hunan University, China , Sun, Jianhua College of Computer Science and Electronic Engineering - Hunan University, China , Chen, Hao College of Computer Science and Electronic Engineering - Hunan University, China
Pages :
12
From page :
1
To page :
12
Abstract :
The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%.
Keywords :
Optimizing Checkpoint , Data Deduplication , The increasing scale , hardware , software , DMTCP
Journal title :
Scientific Programming
Serial Year :
2016
Full Text URL :
Record number :
2607480
Link To Document :
بازگشت