DocumentCode :
1069466
Title :
Incorporation of optimal timeouts into distributed real-time load sharing
Author :
Hou, Chao-Ju ; Shin, Kang G.
Author_Institution :
Real-Time Comput. Lab., Michigan Univ., Ann Arbor, MI, USA
Volume :
43
Issue :
5
fYear :
1994
fDate :
5/1/1994 12:00:00 AM
Firstpage :
528
Lastpage :
547
Abstract :
Consideration is given to the problem of designing and incorporating a timeout mechanism into load sharing (LS) with state-region change broadcasts in the presence of node failures in a distributed real-time system. Failure of a node is diagnosed by the other nodes through communication timeouts, and the timeout period used to diagnose whether a node is faulty or not usually depends on the dynamic changes in system load, the task attributes at the node, and the state the node was initially in. We formulate the problem of determining the “best” timeout period Tout(i) for node i as a hypothesis testing problem, and maximize the probability of detecting node failures subject to a pre-specified probability of falsely diagnosing a healthy node as faulty. The parameters needed for the calculation of Tout(i) are estimated online by node i using the Bayesian technique and are piggy-backed in its region-change broadcasts. The broadcast information is then used to determine Tout(i). If node n has not heard from node i for Tout(i) since its receipt of the latest broadcast from node i, it will consider node i failed, and will not consider any task transfer to node i until it receives a broadcast message from node i again. On the other hand, to further reduce the probability of incorrect diagnosis, each node n also determines its own timeout period Tout(n), and broadcasts its state not only at the time of state-region changes but also when it has remained within a broadcast interval throughout Tout(n)
Keywords :
Bayes methods; fault tolerant computing; multiprocessing systems; parallel algorithms; real-time systems; resource allocation; scheduling; software reliability; Bayesian technique; broadcast information; broadcast interval; broadcast message; communication timeouts; distributed real-time load sharing; distributed real-time system; hypothesis testing problem; node failures; optimal timeouts; region-change broadcasts; state-region change broadcasts; state-region changes; system load; task attributes; task transfer; timeout mechanism; Availability; Bayesian methods; Broadcasting; Chaotic communication; Distributed computing; Fault detection; Helium; Parameter estimation; Real time systems; System testing;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/12.280801
Filename :
280801
Link To Document :
بازگشت