مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

1816896

Title :

Distributed Troubleshooting Agents

Author :

Earl, Charles ; Remolina, Emilio ; Ong, Jim ; Brown, John

Author_Institution :

Stottler Henke Associates Inc., San Mateo, CA

fYear :

2005

fDate :

13-16 June 2005

Firstpage :

365

Lastpage :

366

Abstract :

Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The agent based high availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster and grid computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy´s STAR project

Keywords :

application program interfaces; fault diagnosis; grid computing; software agents; software fault tolerance; system recovery; workstation clusters; ABHA system; STAR project; US Department of Energy; agent API; agent based high availability system; application program interface; batch job recovery; cluster architecture; cluster computing; distributed troubleshooting agents; failure diagnosis; failure mitigation; failure recovery; grid computing; job failure recognition; job restarting; Application software; Availability; Buildings; Computer architecture; Computer languages; Computer networks; Distributed computing; Grid computing; Java; Job production systems;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Autonomic Computing, 2005. ICAC 2005. Proceedings. Second International Conference on

Conference_Location :

Seattle, WA

Print_ISBN :

0-7965-2276-9

Type :

conf

DOI :

10.1109/ICAC.2005.25

Filename :

1498098

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1816896