Title :
Software fault tolerance in a clustered architecture: techniques and reliability modeling
Author :
Lyu, Michael R. ; Mendiratta, Veena B.
Author_Institution :
Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Shatin, Hong Kong
Abstract :
System architectures based on a cluster of computers have gained substantial attention recently. In a clustered system, complex software-intensive applications can be built with commercial hardware, operating systems, and application software to achieve high system availability and data integrity, while performance and cost penalties are greatly reduced by the use of separate error detection hardware and dedicated software fault tolerance routines. Within such a system a watchdog provides mechanisms for error detection and switch-over to a spare or backup processor in the presence of processor failures. The application software is responsible for the extent of the error detection, subsequent recovery actions and data backup. The application can be made as reliable as the user requires, being constrained only by the upper bounds on reliability imposed by the clustered architecture under various implementation schemes. We present reliability modeling and analysis of the clustered system by defining the hardware, operating system, and application software reliability techniques that need to be implemented to achieve different levels of reliability and comparable degrees of data consistency. We describe these reliability levels in terms of fault detection, fault recovery, volatile data consistency, and persistent data consistency, and develop a Markov reliability model to capture these fault detection and recovery activities. We also demonstrate how this cost-effective fault tolerant technique can provide quantitative reliability improvement within applications using clustered architectures
Keywords :
Markov processes; data integrity; error detection; modelling; redundancy; software fault tolerance; system recovery; Markov reliability model; cluster of computers; clustered architecture; complex software-intensive applications; data integrity; error detection; fault detection; fault recovery; high system availability; operating system; persistent data consistency; quantitative reliability improvement; reliability modeling; software fault tolerance; volatile data consistency; watchdog; Application software; Availability; Computer architecture; Computer errors; Costs; Fault detection; Fault tolerance; Hardware; Operating systems; Software performance;
Conference_Titel :
Aerospace Conference, 1999. Proceedings. 1999 IEEE
Conference_Location :
Snowmass at Aspen, CO
Print_ISBN :
0-7803-5425-7
DOI :
10.1109/AERO.1999.790197