DocumentCode :
3571979
Title :
Healing online service systems via mining historical issue repositories
Author :
Rui Ding ; Qiang Fu ; Jian-Guang Lou ; Qingwei Lin ; Dongmei Zhang ; Jiajun Shen ; Tao Xie
Author_Institution :
Microsoft Res., Beijing, China
fYear :
2012
Firstpage :
318
Lastpage :
321
Abstract :
Online service systems have been increasingly popular and important nowadays, with an increasing demand on the availability of services provided by these systems, while significant efforts have been made to strive for keeping services up continuously. Therefore, reducing the MTTR (Mean Time to Restore) of a service remains the most important step to assure the user-perceived availability of the service. To reduce the MTTR, a common practice is to restore the service by identifying and applying an appropriate healing action (i.e., a temporary workaround action such as rebooting a SQL machine). However, manually identifying an appropriate healing action for a given new issue (such as service down) is typically time consuming and error prone. To address this challenge, in this paper, we present an automated mining-based approach for suggesting an appropriate healing action for a given new issue. Our approach generates signatures of an issue from its corresponding transaction logs and then retrieves historical issues from a historical issue repository. Finally, our approach suggests an appropriate healing action by adapting healing actions for the retrieved historical issues. We have implemented a healing suggestion system for our approach and applied it to a real-world product online service that serves millions of online customers globally. The studies on 77 incidents (severe issues) over 3 months showed that our approach can effectively provide appropriate healing actions to reduce the MTTR of the service.
Keywords :
data mining; fault tolerant computing; information services; MTTR; automated mining-based approach; healing online service systems; healing suggestion system; historical issue repository mining; historical issue retrieval; mean time to restore; service user-perceived availability; transaction logs; Online service system; healing action;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on
Print_ISBN :
978-1-4503-1204-2
Type :
conf
DOI :
10.1145/2351676.2351735
Filename :
6494945
Link To Document :
بازگشت