Title :
A fast restart mechanism for checkpoint/recovery protocols in networked environments
Author :
Li, Yawei ; Lan, Zhiling
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
Abstract :
Checkpoint/recovery has been studied extensively, and various optimization techniques have been presented for its improvement. Regardless of the considerable research efforts, little work has been done on improving its restart latency. The time spent on retrieving and loading the checkpoint image during a recovery is non-trivial, especially in networked environments. With the ever-increasing application memory footprint and system failure rate, it is becoming more of an issue. In this paper, we present a fast restart mechanism called FREM. It allows fast restart of a failed process without requiring the availability of the entire checkpoint image. By dynamically tracking the process data accesses after each checkpoint, FREM masks restart latency by overlapping the computation of the resumed process with the retrieval of its checkpoint image. We have implemented FREM with the BLCR checkpointing tool in Linux systems. Our experiments with the SPEC benchmarks indicate that it can effectively reduce restart latency by 61.96% on average in networked environments.
Keywords :
checkpointing; protocols; software tools; Linux systems; checkpoint image; checkpoint-recovery protocols; fast restart mechanism; optimization techniques; restart latency; Access protocols; Checkpointing; Computer networks; Delay; Fault tolerant systems; High performance computing; Image retrieval; Information retrieval; Linux; Runtime;
Conference_Titel :
Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4244-2397-2
Electronic_ISBN :
978-1-4244-2398-9
DOI :
10.1109/DSN.2008.4630090