DocumentCode
1998762
Title
A Framework for Executing Long Running Jobs in Grid Environments
Author
Markatchev, Nayden ; Kiddle, Cameron ; Simmonds, Rob
Author_Institution
Dept. of Comput. Sci., Univ. of Calgary, Calgary, AB
fYear
2008
fDate
9-11 June 2008
Firstpage
69
Lastpage
75
Abstract
Computational jobs that take days, weeks or months to run usually cannot be executed as a single job due to system failures and scheduling constraints. Instead the job must be split into a series of shorter jobs. Solutions for managing the execution of such jobs in grid environments must address many issues. Participating systems and their properties can change over time and therefore it is important to have dynamic resource discovery mechanisms. Data management tools are needed to manage and keep track of data that can be distributed across multiple sites. Fault tolerance is required to handle the many different errors and failures that can occur in such environments. Furthermore, support for job reconfiguration, in terms of the number of processors, run length, and memory required, is necessary to allow jobs to adapt to the heterogeneous resources they are submitted to. This paper presents a framework for executing long running jobs in grid environments that addresses the above issues. The framework automates checkpointing, migration and reconfiguration of jobs. It has been successfully tested with the GROMACS molecular dynamics simulation application in a GT4-based grid environment comprised of resources distributed across Canada.
Keywords
fault tolerant computing; formal verification; grid computing; molecular dynamics method; resource allocation; scheduling; GROMACS molecular dynamics simulation; GT4-based grid environment; checkpointing; computational jobs; data management tools; dynamic resource discovery mechanisms; fault tolerance; grid environments; scheduling constraints; system failures; Application software; Checkpointing; Computer science; Environmental management; Fault tolerance; Grid computing; High performance computing; Mechanical factors; Processor scheduling; Testing; Adaptive Scheduling; Execution Framework; Grid Computing;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on
Conference_Location
Quebec City, Que.
ISSN
1550-5243
Print_ISBN
978-0-7695-3250-9
Type
conf
DOI
10.1109/HPCS.2008.7
Filename
4556075
Link To Document