Title :
Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters
Author :
Laadan, Oren ; Phung, Dan ; Nieh, Jason
Author_Institution :
Dept. of Comput. Sci., Columbia Univ., New York, NY
Abstract :
We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin visualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can he restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low visualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications
Keywords :
Linux; checkpointing; distributed processing; ZapC Linux prototype; ZapC checkpoint-restart operations; cluster nodes; commodity clusters; distributed application; distributed applications; distributed network applications; operating system; transparent checkpoint-restart; transparent coordinated checkpoint-restart; transport protocol; Application software; Application virtualization; Checkpointing; Computer science; Kernel; Libraries; Linux; Operating systems; Sockets; Transport protocols;
Conference_Titel :
Cluster Computing, 2005. IEEE International
Conference_Location :
Burlington, MA
Print_ISBN :
0-7803-9486-0
Electronic_ISBN :
1552-5244
DOI :
10.1109/CLUSTR.2005.347039