Title :
On High-Assurance Scientific Workflows
Author :
Vouk, Mladen A. ; Mouallem, Pierre A.
Author_Institution :
Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
Abstract :
Scientific Workflow Management Systems (S-WFMS), such as Kepler, have proven to be an important tools in scientific problem solving. Interestingly, S-WFMS fault-tolerance and failure recovery is still an open topic. It often involves classic fault-tolerance mechanisms, such as alternative versions and rollback with re-runs, reliance on the fault-tolerance capabilities provided by subcomponents and lower layers such as schedulers, Grid and cloud resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system sees this as failed steps in the process, but frequently without additional detail. This limits S-WFMS´ ability to recover from failures. We describe a light weight end-to-end S-WFMS fault-tolerance framework, developed to handle failure patterns that occur in some real-life scientific workflows. Capabilities and limitations of the framework are discussed and assessed using simulations. The results show that the solution considerably increase workflow reliability and execution time stability.
Keywords :
software fault tolerance; system recovery; workflow management software; Kepler; S-WFMS fault-tolerance framework; failure patterns; failure recovery; high-assurance scientific workflows; scientific problem solving; scientific workflow management systems; Data models; Fault tolerance; Fault tolerant systems; Middleware; Monitoring; Supercomputers; Kepler; Scientific workflows; end-to-end framework; fault-tolerance;
Conference_Titel :
High-Assurance Systems Engineering (HASE), 2011 IEEE 13th International Symposium on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4673-0107-7
DOI :
10.1109/HASE.2011.58