DocumentCode
1854074
Title
Checkpointing Process Groups in a Grid Environment
Author
Mehnert-Spahn, John ; Schottner, Michael ; Morin, Christine
Author_Institution
Dept. of Comput. Sci., Heinrich-Heine Univ., Duesseldorf
fYear
2008
fDate
1-4 Dec. 2008
Firstpage
243
Lastpage
251
Abstract
The EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart.
Keywords
checkpointing; data structures; grid computing; software architecture; Linux control groups; POSIX interface; XtreemOS grid checkpointing architecture; checkpointing process; distributed heterogeneous grid environment; kernel data structures; resource isolation; virtual organizations; Checkpointing; Computer science; Kernel; Linux; Middleware; Operating systems; Personal communication networks; Power system management; Power system security; Resource management; fault tolerance; grid computing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Computing, Applications and Technologies, 2008. PDCAT 2008. Ninth International Conference on
Conference_Location
Otago
Print_ISBN
978-0-7695-3443-5
Type
conf
DOI
10.1109/PDCAT.2008.14
Filename
4710987
Link To Document