Title :
Fault Tolerance and Recovery in Grid Workflow Management Systems
Author :
Sindrilaru, Elvin ; Costan, Alexandru ; Cristea, Valentin
Author_Institution :
Imperial Coll. London, London, UK
Abstract :
Complex scientific workflows are now commonly executed on global grids. With the increasing scale complexity, heterogeneity and dynamism of grid environments the challenges of managing and scheduling these workflows are augmented by dependability issues due to the inherent unreliable nature of large-scale grid infrastructure. In addition to the traditional fault tolerance techniques, specific checkpoint-recovery schemes are needed in current grid workflow management systems to address these reliability challenges. Our research aims to design and develop mechanisms for building an autonomic workflow management system that will exhibit the ability to detect, diagnose, notify, react and recover automatically from failures of workflow execution. In this paper we present the development of a Fault Tolerance and Recovery component that extends the ActiveBPEL workflow engine. The detection mechanism relies on inspecting the messages exchanged between the workflow and the orchestrated Web Services in search of faults. The recovery of a process from a faulted state has been achieved by modifying the default behavior of ActiveBPEL and it basically represents a non-intrusive checkpointing mechanism. We present the results of several scenarios that demonstrate the functionality of the Fault Tolerance and Recovery component, outlining an increase in performance of about 50% in comparison to the traditional method of resubmitting the workflow.
Keywords :
checkpointing; fault tolerant computing; grid computing; scheduling; semantic Web; workflow management software; ActiveBPEL workflow engine; Web services; autonomic workflow management system; checkpoint-recovery schemes; fault tolerance; grid workflow management systems; nonintrusive checkpointing mechanism; workflow scheduling; Buildings; Checkpointing; Engines; Environmental management; Fault detection; Fault tolerance; Fault tolerant systems; Large-scale systems; Web services; Workflow management software; BPEL; dependable systems; fault tolerance; workflow management systems;
Conference_Titel :
Complex, Intelligent and Software Intensive Systems (CISIS), 2010 International Conference on
Conference_Location :
Krakow
Print_ISBN :
978-1-4244-5917-9
DOI :
10.1109/CISIS.2010.113