Title :
Failure analysis of a fault-tolerant 2-node server system
Author :
Jacob, Daniel ; Simon, Eric J. ; Zhang, Wei ; Rose, Dan
Author_Institution :
Relex Software Corp., Greensburg, PA
Abstract :
In this paper, we present an integrated model of hardware and software failures of a fault-tolerant 2-node server system used in a real-life application of an archive system. Each node runs a distinct component of the server application software and identical copies of a fault monitoring service. The fault monitoring service on each node monitors the status of its local application software as well as the availability of the hardware and software on the other node. Upon a node failure, the fault monitoring service on the good node transfers the application software on the failed node to the good node. Upon the failure of an application software component or fault monitoring service, an automatic restoration is performed by the available fault monitoring service. The failed nodes are restored on a first-come, first-serve basis by a single repair facility. The failure and restoration processes of the hardware and software are highly dependent on the status of other components as well as the sequence of failure events. Therefore, we employ a decomposition method that uses both combinatorial analysis as well as Markov-based state space analysis to solve the problem. The proposed method allows us to extend the analysis easily for the cases of multiple nodes, software components, and different repair policies
Keywords :
Markov processes; failure analysis; fault tolerant computing; maintenance engineering; monitoring; queueing theory; records management; reliability theory; software reliability; storage area networks; storage management; Markov-based state space analysis; archive system; combinatorial analysis; fault monitoring service; fault-tolerant 2-node server system; hardware failure analysis; server application software; single repair facility; software components; software failures; Aerospace industry; Application software; Blades; Condition monitoring; Failure analysis; Fault tolerant systems; Hardware; Performance analysis; Software performance; Storage automation;
Conference_Titel :
Reliability and Maintainability Symposium, 2006. RAMS '06. Annual
Conference_Location :
Newport Beach, CA
Print_ISBN :
1-4244-0007-4
Electronic_ISBN :
0149-144X
DOI :
10.1109/RAMS.2006.1677427