Title :
Automatic fault characterization via abnormality-enhanced classification
Author :
Bronevetsky, Greg ; Laguna, Ignacio ; De Supinski, Bronis R. ; Bagchi, Saurabh
Author_Institution :
Lawrence Livermore Nat. Lab., Livermore, CA, USA
Abstract :
Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and to diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately characterize normal and abnormal system behavior. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the complexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of application behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.
Keywords :
fault diagnosis; fault tolerant computing; pattern classification; pattern clustering; statistical analysis; abnormal system behavior characterization; abnormality-enhanced classification; automatic fault characterization; classification algorithms; clustering algorithms; fault detection; statistical modeling approach; system fault complexity; Accuracy; Biological system modeling; Computational modeling; Hardware; Radiation detectors; Software; Training; autonomic management; fault detection; root cause analysis; statistical modeling;
Conference_Titel :
Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-1624-8
Electronic_ISBN :
1530-0889
DOI :
10.1109/DSN.2012.6263926