DocumentCode
2502209
Title
Understanding large system failures-a fault injection experiment
Author
Chillarege, R. ; Bowen, N.S.
Author_Institution
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
fYear
1989
fDate
21-23 June 1989
Firstpage
356
Lastpage
363
Abstract
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<>
Keywords
fault tolerant computing; software reliability; catastrophic failure; commercial transaction processing system; failure acceleration; fault injection experiment; field failure data; large system failures; modeling of availability; potential hazards; Acceleration; Automatic control; Automatic testing; Automation; Cause effect analysis; Control systems; Failure analysis; Hazards; Laboratories; Software systems;
fLanguage
English
Publisher
ieee
Conference_Titel
Fault-Tolerant Computing, 1989. FTCS-19. Digest of Papers., Nineteenth International Symposium on
Conference_Location
Chicago, IL, USA
Print_ISBN
0-8186-1959-7
Type
conf
DOI
10.1109/FTCS.1989.105592
Filename
105592
Link To Document