Title :
FREM: A Fast Restart Mechanism for General Checkpoint/Restart
Author :
Li, Yawei ; Lan, Zhiling
Author_Institution :
Google Inc., Mountain View, CA, USA
fDate :
5/1/2011 12:00:00 AM
Abstract :
As failure rate keeps on increasing in large systems, applications running atop restart more frequently than ever. Existing research on checkpoint/restart mainly focuses on optimizing checkpoint operation, without paying much attention to the restart operation. As a result, application restart latency maybe substantial, which greatly threatens system dependability and performance. To attack the restart latency problem, in this paper, we present FREM, a fast restart mechanism for general checkpoint/restart protocols. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping application recovery with the retrieval of its checkpoint image. We have implemented FREM as a prototype system and tested it under Linux environments. Extensive experiments with real applications demonstrate that it can effectively reduce restart latency by over 50 percent on average, as compared to the conventional restart mechanisms.
Keywords :
Linux; checkpointing; fault tolerant computing; software fault tolerance; Linux; application restart latency; checkpoint operation; checkpoint-restart protocols; failure rate; fast restart mechanism; fault tolerance technique; restart operation; system dependability; system performance; Fast restart; Linux; fault tolerance; high performance computing.; operating system;
Journal_Title :
Computers, IEEE Transactions on
DOI :
10.1109/TC.2010.129