• DocumentCode
    1998762
  • Title

    A Framework for Executing Long Running Jobs in Grid Environments

  • Author

    Markatchev, Nayden ; Kiddle, Cameron ; Simmonds, Rob

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Calgary, Calgary, AB
  • fYear
    2008
  • fDate
    9-11 June 2008
  • Firstpage
    69
  • Lastpage
    75
  • Abstract
    Computational jobs that take days, weeks or months to run usually cannot be executed as a single job due to system failures and scheduling constraints. Instead the job must be split into a series of shorter jobs. Solutions for managing the execution of such jobs in grid environments must address many issues. Participating systems and their properties can change over time and therefore it is important to have dynamic resource discovery mechanisms. Data management tools are needed to manage and keep track of data that can be distributed across multiple sites. Fault tolerance is required to handle the many different errors and failures that can occur in such environments. Furthermore, support for job reconfiguration, in terms of the number of processors, run length, and memory required, is necessary to allow jobs to adapt to the heterogeneous resources they are submitted to. This paper presents a framework for executing long running jobs in grid environments that addresses the above issues. The framework automates checkpointing, migration and reconfiguration of jobs. It has been successfully tested with the GROMACS molecular dynamics simulation application in a GT4-based grid environment comprised of resources distributed across Canada.
  • Keywords
    fault tolerant computing; formal verification; grid computing; molecular dynamics method; resource allocation; scheduling; GROMACS molecular dynamics simulation; GT4-based grid environment; checkpointing; computational jobs; data management tools; dynamic resource discovery mechanisms; fault tolerance; grid environments; scheduling constraints; system failures; Application software; Checkpointing; Computer science; Environmental management; Fault tolerance; Grid computing; High performance computing; Mechanical factors; Processor scheduling; Testing; Adaptive Scheduling; Execution Framework; Grid Computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing Systems and Applications, 2008. HPCS 2008. 22nd International Symposium on
  • Conference_Location
    Quebec City, Que.
  • ISSN
    1550-5243
  • Print_ISBN
    978-0-7695-3250-9
  • Type

    conf

  • DOI
    10.1109/HPCS.2008.7
  • Filename
    4556075