Title :
VCCP: A transparent, coordinated checkpointing system for virtualization-based cluster computing
Author :
Ong, Hong ; Saragol, Natthapol ; Chanchio, Kasidit ; Leangsuksun, Chokchai
Author_Institution :
Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fDate :
Aug. 31 2009-Sept. 4 2009
Abstract :
Virtual machine, which typically consists of a guest operating system (OS) and its serial applications, can be checkpointed, migrated to another cluster node, and restarted later to its previous saved state. However, to date, it is nontrivial to provide checkpoint-restart mechanisms with the same level of transparency for distributed applications running on a cluster of virtual machines. To address this particular issue, we have created the Virtual Cluster CheckPointing (VCCP) system, a novel system for transparent coordinated checkpoint-restart of virtual machines and its distributed application on commodity clusters. In this paper, we detail the design and implementation of the VCCP system. Our VCCP prototype extends the open source QEMU system with kqemu module by implementing hypervisor-based Coordinated Checkpoint-Restart protocols. To verify and validate our prototype, we measured its performance using the NAS parallel benchmark. Our experimental results indicate that VCCP generates less than 1% of additional execution overhead for non-communication intensive parallel applications. Furthermore, our correctness analysis shows that VCCP does not cause message loss or reordering, which is a necessary property to ensure correctness of checkpoint-restart mechanism. Finally, we believe that VCCP is a promising checkpoint-restart alternative for legacy applications that have implemented traditional process-level checkpoint-restart.
Keywords :
checkpointing; distributed processing; operating systems (computers); software fault tolerance; virtual machines; checkpoint-restart mechanisms; coordinated checkpointing system; operating system; process-level checkpoint-restart; transparent coordinated checkpoint-restart; virtual cluster checkpointing; virtual machine cluster; virtualization-based cluster computing; Application software; Application virtualization; Checkpointing; Computer science; Fault tolerant systems; High performance computing; Operating systems; Power system reliability; Prototypes; Virtual machining; coordinated checkpointing; fault tolerance; high performance computing; virtualization;
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4244-5011-4
Electronic_ISBN :
1552-5244
DOI :
10.1109/CLUSTR.2009.5289183