• DocumentCode
    584807
  • Title

    Application checkpointing in grid environment with improved checkpoint reliability through replication

  • Author

    Bawa, Rajesh Kumar ; Singh, Rajdeep

  • Author_Institution
    Dept. of Comput. Sci., Punjabi Univ., Patiala, India
  • fYear
    2012
  • fDate
    26-28 July 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job. Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job´s resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.
  • Keywords
    checkpointing; fault tolerant computing; grid computing; reliability; resource allocation; scheduling; Alchemi based grid environment; application checkpointing based fault tolerance technique; application threads; fault free situations; geographically distributed heterogeneous resources; grid environment; grid technologies; improved checkpoint reliability; job execution delay; job failure; job resuming capability; next generation distributed computing; replication; Message systems; Reliability engineering; Time frequency analysis; Fault Tolerance; Job Scheduling; QoS (Quality of Service); Resource Management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computing Communication & Networking Technologies (ICCCNT), 2012 Third International Conference on
  • Conference_Location
    Coimbatore
  • Type

    conf

  • DOI
    10.1109/ICCCNT.2012.6395974
  • Filename
    6395974