DocumentCode
692905
Title
Optimization of cloud task processing with checkpoint-restart mechanism
Author
Sheng Di ; Robert, Yannick ; Vivien, F. ; Kondo, Daishi ; Cho-Li Wang ; Cappello, Franck
Author_Institution
Argonne Nat. Lab., Argonne, IL, USA
fYear
2013
fDate
17-22 Nov. 2013
Firstpage
1
Lastpage
12
Abstract
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young´s formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.
Keywords
cloud computing; failure analysis; software fault tolerance; software reliability; statistical distributions; virtual machines; Berkeley lab checkpoint-restart tool; adaptive algorithm; checkpoint-restart mechanism; cloud computing; cloud task processing optimization; failure event distributions; failure probability distribution; large-scale Google data center; optimizing fault-tolerance techniques; production trace; real cluster environment; task failure events; virtual machines; Checkpointing; Cloud computing; Clouds; Fault tolerance; Fault tolerant systems; Google; Probability distribution; BLCR; Checkpoint-Restart Mechanism; Cloud Computing; Google; Optimal Checkpointing Interval;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
Conference_Location
Denver, CO
Print_ISBN
978-1-4503-2378-9
Type
conf
DOI
10.1145/2503210.2503217
Filename
6877497
Link To Document