• DocumentCode
    1816896
  • Title

    Distributed Troubleshooting Agents

  • Author

    Earl, Charles ; Remolina, Emilio ; Ong, Jim ; Brown, John

  • Author_Institution
    Stottler Henke Associates Inc., San Mateo, CA
  • fYear
    2005
  • fDate
    13-16 June 2005
  • Firstpage
    365
  • Lastpage
    366
  • Abstract
    Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The agent based high availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster and grid computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy´s STAR project
  • Keywords
    application program interfaces; fault diagnosis; grid computing; software agents; software fault tolerance; system recovery; workstation clusters; ABHA system; STAR project; US Department of Energy; agent API; agent based high availability system; application program interface; batch job recovery; cluster architecture; cluster computing; distributed troubleshooting agents; failure diagnosis; failure mitigation; failure recovery; grid computing; job failure recognition; job restarting; Application software; Availability; Buildings; Computer architecture; Computer languages; Computer networks; Distributed computing; Grid computing; Java; Job production systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Autonomic Computing, 2005. ICAC 2005. Proceedings. Second International Conference on
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    0-7965-2276-9
  • Type

    conf

  • DOI
    10.1109/ICAC.2005.25
  • Filename
    1498098