DocumentCode :
541784
Title :
A hierarchical fault detection and recovery in a computational grid using watchdog timers
Author :
Bhagyashree, A.H. ; Pradeep, Deepthi ; Jayanthy, N. ; Mounica, K.V. ; Nivejaa, S. ; Saranya Dharani, P.
Author_Institution :
Inf. Technol., Amrita Vishwa Vidyapeetham, Coimbatore, India
fYear :
2010
fDate :
27-29 Dec. 2010
Firstpage :
467
Lastpage :
471
Abstract :
Grid computing basically means applying the resources of individual computers in a network to focus on a single problem/task at the same time. But the disadvantage of this feature is that the computers which are actually performing the calculations might not be always trustworthy and may fail periodically. Hence larger the number of nodes in the grid, greater is the probability that a node fails. Hence in order to execute the workflows in a fault tolerant manner we go for fault tolerance and recovery strategies. This paper proposes a method in which the instantaneous snapshot of the local state of processes within each node is recorded. An efficient algorithm is introduced for the detection of the node failures using watch dog timers. For recovery we make use of divide and conquer algorithm that avoids redoing of already completed jobs, enabling faster recovery.
Keywords :
divide and conquer methods; grid computing; software fault tolerance; system recovery; computational grid; divide and conquer algorithm; fault recovery strategy; fault tolerance; grid computing; hierarchical fault detection; node failure detection; watchdog timer; Clustering algorithms; Fault detection; Fault tolerance; Fault tolerant systems; Load management; Optimal scheduling; Radiation detectors; Grid Computing; cluster; fault; load balancing; watch dog timer;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communication and Computational Intelligence (INCOCCI), 2010 International Conference on
Conference_Location :
Erode
Type :
conf
Filename :
5738775
Link To Document :
بازگشت