Author :
Sakai, Masato ; Matsuba, Hiroya ; Ishikawa, Yutaka
Abstract :
We propose a fault detection system activated by an application when the application recognizes the occurrence of a failure, in order to realize self managing systems that automatically find the source of a failure. In existing detection systems, there are three issues for constructing self managing applications: i) the detection results are not sent to the applications, ii) they can not identify the source failure from all of the detected failures, and iii) configuring the detection system for networked system is hard work. For overcoming these issues, the proposed system takes three approaches: i) the system receives failure information from an application and returns a result set to the application, ii) the system identifies the source failure using relationships among errors, and Hi) the system obtains information of the monitored system from a database. The relationship is expressed by a tree. This tree is called error relationship tree. The database provides information which are system entities such as hardware devices, software object, and network topology. When the proposed system starts looking for the source of a failure, causal relations from an error relation tree are referred to, and the correspondence of error definitions and actual objects is derived using the database. We show the design of the detection operation activated by the failure information and the architecture of the proposed system.
Keywords :
fault tolerant computing; object-oriented databases; system recovery; common information model; error relationship tree; fault detection system; hardware devices; network topology; networked system; object-oriented database; self managing systems; software object; Computer errors; Computer integrated manufacturing; Computerized monitoring; Condition monitoring; Databases; Error correction; Fault detection; Hardware; Network topology; Switches;