Title :
Distributed Troubleshooting Agents
Author :
Earl, Charles ; Remolina, Emilio ; Ong, Jim ; Brown, John
Author_Institution :
Stottler Henke Associates Inc., San Mateo, CA
Abstract :
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The agent based high availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster and grid computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy´s STAR project
Keywords :
application program interfaces; fault diagnosis; grid computing; software agents; software fault tolerance; system recovery; workstation clusters; ABHA system; STAR project; US Department of Energy; agent API; agent based high availability system; application program interface; batch job recovery; cluster architecture; cluster computing; distributed troubleshooting agents; failure diagnosis; failure mitigation; failure recovery; grid computing; job failure recognition; job restarting; Application software; Availability; Buildings; Computer architecture; Computer languages; Computer networks; Distributed computing; Grid computing; Java; Job production systems;
Conference_Titel :
Autonomic Computing, 2005. ICAC 2005. Proceedings. Second International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7965-2276-9
DOI :
10.1109/ICAC.2005.25