DocumentCode :
732107
Title :
Crying Wolf and Meaning It: Reducing False Alarms in Monitoring of Sporadic Operations through POD-Monitor
Author :
Xiwei Xu ; Liming Zhu ; Min Fu ; Sun, Daniel ; An Binh Tran ; Rimba, Paul ; Dwarakanathan, Srini ; Bass, Len
Author_Institution :
SSRG, NICTA, Sydney, NSW, Australia
fYear :
2015
fDate :
23-23 May 2015
Firstpage :
69
Lastpage :
75
Abstract :
When monitoring complex applications in cloud systems, a difficult problem for operators is receiving false positive alarms. This becomes worse when the system is sporadically being changed and upgraded due to the emerging continuous deployment practice. Other legitimate but sporadic maintenance operations, such as log compression, garbage collection and data reconstruction in distributed systems can also trigger false alarms. Consequently, traditional baseline-based anomaly detection and monitoring is less effective. A normal but dangerous practice is to turn off normal monitoring during sporadic operations such as upgrade and maintenance. In this paper, we report on the use of the process context information of sporadic operations to suppress false positive alarms. We use the context information both directly and in machine learning. Our experimental evaluation shows that 1) using process context directly improves the alarm precision up to 0.226 (36.1% improvement), 2) using process-context trained machine learning models improves the precision rate up to 0.421 (84.7% improvement).
Keywords :
cloud computing; distributed processing; learning (artificial intelligence); storage management; system monitoring; baseline-based anomaly detection and monitoring; cloud system; continuous deployment practice; crying wolf; data reconstruction; distributed system; false alarm; garbage collection; log compression; monitoring complex application; normal monitoring; pod-monitor; process-context trained machine learning model; sporadic maintenance operation; sporadic operation; Context; Context modeling; Maintenance engineering; Measurement; Monitoring; Predictive models; Training; Alarm; Monitoring; Operation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Complex Faults and Failures in Large Software Systems (COUFLESS), 2015 IEEE/ACM 1st International Workshop on
Conference_Location :
Florence
Type :
conf
DOI :
10.1109/COUFLESS.2015.18
Filename :
7181485
Link To Document :
بازگشت