DocumentCode :
2320920
Title :
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures
Author :
Silva, Rafael Ferreira da ; Glatard, Tristan ; Desprez, Frédéric
Author_Institution :
INSERM, Univ. of Lyon, Villeurbanne, France
fYear :
2012
fDate :
13-16 May 2012
Firstpage :
318
Lastpage :
325
Abstract :
Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.
Keywords :
data mining; electronic data interchange; grid computing; software fault tolerance; European Grid Infrastructure; application efficiency; association rules; data transfer issues; distributed computing infrastructures; incidents classification; long-tail effect; operational incidents handling; operational workflow incident self-healing; resource characteristics; scientific gateways; site-specific problems; virtual imaging platform; Association rules; Computational modeling; Logic gates; Measurement; Monitoring; Wheels; Error detection and handling; Production distributed systems; Workflow execution;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
Conference_Location :
Ottawa, ON
Print_ISBN :
978-1-4673-1395-7
Type :
conf
DOI :
10.1109/CCGrid.2012.24
Filename :
6217437
Link To Document :
بازگشت