• DocumentCode
    1692850
  • Title

    Improving Fault Tolerance in Desktop Grids Based On Incremental Checkpointing

  • Author

    El-Desoky, Ali E. ; Ali, Hisham A. ; Azab, Abdulrahman A.

  • Author_Institution
    Dept. of Comput. Eng., Mansoura Univ.
  • fYear
    2006
  • Firstpage
    386
  • Lastpage
    392
  • Abstract
    Fault tolerance is an important issue to guarantee reliable execution of tasks in computational desktop grid environment where execution failures are frequently expected. Periodic checkpointing of running tasks is one of the common strategies for achieving acceptable fault tolerance. A problem usually arises, that is, temporary stored data in a checkpoint file for some long running tasks might be too large in size to be reliably transmitted between nodes without consuming network bandwidth. Data loss may also occur when transmitting such large amount of data in a non-reliable communication environment (e.g. desktop grid). In this paper, a modified application level incremental checkpointing approach is proposed in which the size of transmitted checkpoint data can be reduced to about 3% of its original size with little overhead on computation time. The proposed approach also investigates a new mechanism for safely storing a checkpoint file with reliance on the availability of the submitting node only. A simulator have been built using the .Net framework 1.1 to test the validity of the proposed approach using an application code built on variable dimensions´ matrix multiplication. Experimental results show that the proposed approach improved fault tolerance with minimizing computational overhead
  • Keywords
    checkpointing; digital simulation; fault tolerant computing; grid computing; network operating systems; .Net framework 1.1; application level incremental checkpointing; computational desktop grid environment; computer simulator; fault tolerant computing; Automatic logic units; Bandwidth; Checkpointing; Distributed computing; Fault tolerance; Grid computing; Internet; Network servers; Peer to peer computing; Reliability engineering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Engineering and Systems, The 2006 International Conference on
  • Conference_Location
    Cairo
  • Print_ISBN
    1-4244-0271-9
  • Electronic_ISBN
    1-4244-0272-7
  • Type

    conf

  • DOI
    10.1109/ICCES.2006.320479
  • Filename
    4115539