DocumentCode
1816896
Title
Distributed Troubleshooting Agents
Author
Earl, Charles ; Remolina, Emilio ; Ong, Jim ; Brown, John
Author_Institution
Stottler Henke Associates Inc., San Mateo, CA
fYear
2005
fDate
13-16 June 2005
Firstpage
365
Lastpage
366
Abstract
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The agent based high availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster and grid computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy´s STAR project
Keywords
application program interfaces; fault diagnosis; grid computing; software agents; software fault tolerance; system recovery; workstation clusters; ABHA system; STAR project; US Department of Energy; agent API; agent based high availability system; application program interface; batch job recovery; cluster architecture; cluster computing; distributed troubleshooting agents; failure diagnosis; failure mitigation; failure recovery; grid computing; job failure recognition; job restarting; Application software; Availability; Buildings; Computer architecture; Computer languages; Computer networks; Distributed computing; Grid computing; Java; Job production systems;
fLanguage
English
Publisher
ieee
Conference_Titel
Autonomic Computing, 2005. ICAC 2005. Proceedings. Second International Conference on
Conference_Location
Seattle, WA
Print_ISBN
0-7965-2276-9
Type
conf
DOI
10.1109/ICAC.2005.25
Filename
1498098
Link To Document