A Framework for Executing Long Running Jobs in Grid Environments

Author

Markatchev, Nayden ; Kiddle, Cameron ; Simmonds, Rob

Author_Institution

Dept. of Comput. Sci., Univ. of Calgary, Calgary, AB

fYear

2008

fDate

9-11 June 2008

Firstpage

69

Lastpage

75

Abstract

Computational jobs that take days, weeks or months to run usually cannot be executed as a single job due to system failures and scheduling constraints. Instead the job must be split into a series of shorter jobs. Solutions for managing the execution of such jobs in grid environments must address many issues. Participating systems and their properties can change over time and therefore it is important to have dynamic resource discovery mechanisms. Data management tools are needed to manage and keep track of data that can be distributed across multiple sites. Fault tolerance is required to handle the many different errors and failures that can occur in such environments. Furthermore, support for job reconfiguration, in terms of the number of processors, run length, and memory required, is necessary to allow jobs to adapt to the heterogeneous resources they are submitted to. This paper presents a framework for executing long running jobs in grid environments that addresses the above issues. The framework automates checkpointing, migration and reconfiguration of jobs. It has been successfully tested with the GROMACS molecular dynamics simulation application in a GT4-based grid environment comprised of resources distributed across Canada.

Keywords

fault tolerant computing; formal verification; grid computing; molecular dynamics method; resource allocation; scheduling; GROMACS molecular dynamics simulation; GT4-based grid environment; checkpointing; computational jobs; data management tools; dynamic resource discovery mechanisms; fault tolerance; grid environments; scheduling constraints; system failures; Application software; Checkpointing; Computer science; Environmental management; Fault tolerance; Grid computing; High performance computing; Mechanical factors; Processor scheduling; Testing; Adaptive Scheduling; Execution Framework; Grid Computing;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on

Conference_Location

Quebec City, Que.

ISSN

1550-5243

Print_ISBN

978-0-7695-3250-9

Type

conf

DOI

10.1109/HPCS.2008.7

Filename

4556075