• DocumentCode
    1835976
  • Title

    Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis

  • Author

    Jiang, Miao ; Munawar, Mohammad A. ; Reidemeister, Thomas ; Ward, Paul A S

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Waterloo, Waterloo, ON
  • fYear
    2008
  • fDate
    3-5 Dec. 2008
  • Firstpage
    323
  • Lastpage
    332
  • Abstract
    A correctly functioning enterprise-software system exhibits long-term, stable correlations between many of its monitoring metrics. Some of these correlations no longer hold when there is an error in the system, potentially enabling error detection and fault diagnosis. However, existing approaches are inefficient, requiring a large number of metrics to be monitored and ignoring the relative discriminative properties of different metric correlations. In enterprise-software systems, similar faults tend to reoccur. It is therefore possible to significantly improve existing correlation-analysis approaches by learning the effects of common recurrent faults on correlations. We present methods to determine the most significant correlations to track for efficient error detection, and the correlations that contribute the most to diagnosis accuracy. We apply machine learning to identify the relevant correlations, removing the need for manually configured correlation thresholds, as used in the prior approaches. We validate our work on a multi-tier enterprise-software system. We are able to detect and correctly diagnose 8 of 10 injected faults to within three possible causes, and to within two in 7 out of 8 cases. This compares favourably with the existing approaches whose diagnosis accuracy is 3 out of 10 to within 3 possible causes. We achieve a precision of at least 95%.
  • Keywords
    business data processing; learning (artificial intelligence); software fault tolerance; software metrics; enterprise-software system; error detection; invariant analysis; machine learning; monitoring metrics; recurrent fault detection; recurrent fault diagnosis; Availability; Computer errors; Computerized monitoring; Condition monitoring; Electrical fault detection; Fault detection; Fault diagnosis; Neural networks; Software systems; Systems engineering and theory; error detection; fault diagnosis; metric correlations; neural network; system invariants;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE
  • Conference_Location
    Nanjing
  • ISSN
    1530-2059
  • Print_ISBN
    978-0-7695-3482-4
  • Type

    conf

  • DOI
    10.1109/HASE.2008.16
  • Filename
    4708890