• DocumentCode
    2502209
  • Title

    Understanding large system failures-a fault injection experiment

  • Author

    Chillarege, R. ; Bowen, N.S.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • fYear
    1989
  • fDate
    21-23 June 1989
  • Firstpage
    356
  • Lastpage
    363
  • Abstract
    Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors: (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.<>
  • Keywords
    fault tolerant computing; software reliability; catastrophic failure; commercial transaction processing system; failure acceleration; fault injection experiment; field failure data; large system failures; modeling of availability; potential hazards; Acceleration; Automatic control; Automatic testing; Automation; Cause effect analysis; Control systems; Failure analysis; Hazards; Laboratories; Software systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1989. FTCS-19. Digest of Papers., Nineteenth International Symposium on
  • Conference_Location
    Chicago, IL, USA
  • Print_ISBN
    0-8186-1959-7
  • Type

    conf

  • DOI
    10.1109/FTCS.1989.105592
  • Filename
    105592