DocumentCode :
1514518
Title :
FREM: A Fast Restart Mechanism for General Checkpoint/Restart
Author :
Li, Yawei ; Lan, Zhiling
Author_Institution :
Google Inc., Mountain View, CA, USA
Volume :
60
Issue :
5
fYear :
2011
fDate :
5/1/2011 12:00:00 AM
Firstpage :
639
Lastpage :
652
Abstract :
As failure rate keeps on increasing in large systems, applications running atop restart more frequently than ever. Existing research on checkpoint/restart mainly focuses on optimizing checkpoint operation, without paying much attention to the restart operation. As a result, application restart latency maybe substantial, which greatly threatens system dependability and performance. To attack the restart latency problem, in this paper, we present FREM, a fast restart mechanism for general checkpoint/restart protocols. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping application recovery with the retrieval of its checkpoint image. We have implemented FREM as a prototype system and tested it under Linux environments. Extensive experiments with real applications demonstrate that it can effectively reduce restart latency by over 50 percent on average, as compared to the conventional restart mechanisms.
Keywords :
Linux; checkpointing; fault tolerant computing; software fault tolerance; Linux; application restart latency; checkpoint operation; checkpoint-restart protocols; failure rate; fast restart mechanism; fault tolerance technique; restart operation; system dependability; system performance; Fast restart; Linux; fault tolerance; high performance computing.; operating system;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2010.129
Filename :
5483295
Link To Document :
بازگشت