Title :
Automatic Problem Localization via Multi-dimensional Metric Profiling
Author :
Laguna, Ignacio ; Mitra, Subhasish ; Arshad, Fahad A. ; Theera-Ampornpunt, Nawanol ; Zongyang Zhu ; Bagchi, Saurabh ; Midkiff, Samuel P. ; Kistler, Mike ; Gheith, Ahmed
fDate :
Sept. 30 2013-Oct. 3 2013
Abstract :
Debugging today´s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present Orion, a framework to automate the problem-localization process in distributed applications. From a large set of metrics, Orion intelligently chooses important metrics and models the application´s runtime behavior through pair wise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression testing framework from IBM. Our results show that Orion is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.
Keywords :
distributed processing; program debugging; software performance evaluation; statistical testing; HBase; Hadoop DFS; IBM; ORION; automatic problem localization; breakpoint-based debugging; bug locating process automation; campus-wide Java application; debugging techniques; large-scale distributed applications; multidimensional metric profiling; nonoverlapping time windows; performance anomalies; problem-localization process; regression testing framework; Algorithm design and analysis; Computer bugs; Correlation; Debugging; Hardware; Measurement; Principal component analysis; debugging aids; diagnostics; performance metrics; tracing;
Conference_Titel :
Reliable Distributed Systems (SRDS), 2013 IEEE 32nd International Symposium on
Conference_Location :
Braga
DOI :
10.1109/SRDS.2013.21