DocumentCode :
657560
Title :
Identifying silent failures of SaaS services using finite state machine based invariant analysis
Author :
Goel, Geetika ; Roy, Anirban ; Ganesan, Rajeshwari
Author_Institution :
Next Gen Comput. Lab., Infosys Labs., Bangalore, India
fYear :
2013
fDate :
4-7 Nov. 2013
Firstpage :
290
Lastpage :
295
Abstract :
Field failure analysis is usually driven by a characterization of the different time related properties of failure. This characterization does not help the production support team in understanding the root cause. In order to pinpoint the root cause of failure, one of the most effective techniques used is checking for violations of the system invariants which are the consistent, time invariant correlations that exist in the system. Understanding when and where these violations happen helps in detecting the root cause of the failure. Silent failures, on the other hand are characterized by no evidence of failures either in the console or in the field failure logs. They are unearthed at moments of crisis, either with a customer complaint or other cascading failures. These failures often result in data loss or data corruption, creating many latent errors. Accumulation of these errors over time results in degraded system performance. This represents the problem of software aging and restoration of the system, i.e. its rejuvenation becomes a critical need. Subsequent to the restoration, a rigorous failure detection mechanism is needed to detect them early. What we describe in the paper is a novel method that could be used to detect silent failures using a combination of invariant violation checking and finite state machine based analysis of the system. We use the audit-trail logs of system to extract information about the state and transitions for FSM representation. Currently our research work was limited to proving its efficiency. We applied this approach to our SaaS platform and were able to detect 36 silent failures over a period of 9 months. As next steps, we will implement this as a part of automated failure detection in the operational SaaS platforms.
Keywords :
cloud computing; finite state machines; software fault tolerance; FSM representation; SaaS services; audit-trail logs; automated failure detection; data corruption; data loss; field failure analysis; finite state machine based invariant analysis; information extraction; operational SaaS platforms; software aging; software restoration; Automata; Computer crashes; Correlation; Data mining; Failure analysis; Production; Software as a service; console logs; false positives; finite state machine; invariant violations; silent failures; state properties;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Reliability Engineering Workshops (ISSREW), 2013 IEEE International Symposium on
Conference_Location :
Pasadena, CA
Type :
conf
DOI :
10.1109/ISSREW.2013.6688909
Filename :
6688909
Link To Document :
بازگشت