• DocumentCode
    2320920
  • Title

    Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

  • Author

    Silva, Rafael Ferreira da ; Glatard, Tristan ; Desprez, Frédéric

  • Author_Institution
    INSERM, Univ. of Lyon, Villeurbanne, France
  • fYear
    2012
  • fDate
    13-16 May 2012
  • Firstpage
    318
  • Lastpage
    325
  • Abstract
    Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.
  • Keywords
    data mining; electronic data interchange; grid computing; software fault tolerance; European Grid Infrastructure; application efficiency; association rules; data transfer issues; distributed computing infrastructures; incidents classification; long-tail effect; operational incidents handling; operational workflow incident self-healing; resource characteristics; scientific gateways; site-specific problems; virtual imaging platform; Association rules; Computational modeling; Logic gates; Measurement; Monitoring; Wheels; Error detection and handling; Production distributed systems; Workflow execution;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
  • Conference_Location
    Ottawa, ON
  • Print_ISBN
    978-1-4673-1395-7
  • Type

    conf

  • DOI
    10.1109/CCGrid.2012.24
  • Filename
    6217437