DocumentCode :
330834
Title :
Reliability analysis of clustered computing systems
Author :
Mendiratta, Veena B.
Author_Institution :
AT&T Bell Labs., Naperville, IL, USA
fYear :
1998
fDate :
4-7 Nov 1998
Firstpage :
268
Lastpage :
272
Abstract :
Clustered computing systems, using commercially available computers networked in a loosely-coupled fashion, can provide high levels of reliability if appropriate levels of error detection and recovery software are implemented in the middleware and application layers. In this paper, we present a modeling approach for analyzing the hardware and software reliability of clustered computing systems. The clustered system is modeled as an irreducible Markov chain with working and failed states, and intermediate recovery states. The failure and recovery behavior is characterized in terms of the frequency and duration of fault recoveries and outages for a single processor in the cluster and for the entire clustered system. We apply the model to a telecommunication switching system application that uses the Lucent Technologies Reliable Clustered Computing product. The model results are presented for a range of values of the processor failure rate and the fault recovery coverage factor
Keywords :
Markov processes; client-server systems; computer network reliability; electronic switching systems; error detection; software reliability; switching networks; system recovery; telecommunication computing; workstation clusters; Lucent Technologies Reliable Clustered Computing product; application layers; clustered computing systems reliability; commercially available computers; error detection; error recovery software; failed states; failure behavior; fault recovery behavior; fault recovery coverage factor; hardware reliability; intermediate recovery states; irreducible Markov chain; loosely-coupled computer network; middleware; modeling; networked computers; outages; processor failure rate; software reliability; telecommunication switching system; working states; Application software; Computer errors; Computer network reliability; Computer networks; Frequency; Hardware; Middleware; Software reliability; Telecommunication computing; Telecommunication switching;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Reliability Engineering, 1998. Proceedings. The Ninth International Symposium on
Conference_Location :
Paderborn
ISSN :
1071-9458
Print_ISBN :
0-8186-8991-9
Type :
conf
DOI :
10.1109/ISSRE.1998.730890
Filename :
730890
Link To Document :
بازگشت