Handling Failures in Parallel Scientific Workflows Using Clouds

Author

Costa, Francois ; de Oliveira, Daniel ; Ocala, Kary ; Ogasawara, Eduardo ; Dias, Joana ; Mattoso, Marta

Author_Institution

COPPE, Fed. Univ. of Rio de Janeiro, Rio de Janeiro, Brazil

fYear

2012

fDate

10-16 Nov. 2012

Firstpage

129

Lastpage

139

Abstract

Failures are common in High Performance Computing (HPC) environments and can significantly impact the performance of scientific workflows executing on top of these large scale computing environments. Computing clouds are being used as promising HPC environments. Although clouds offer several advantages such as elasticity and availability, failures are very frequent in this type of environment, where virtualization, instabilities and providers´ actions directly impact on workflow execution. In this way, activity failures are almost inevitable in clouds where virtual machine failures are a reality rather than a possibility. In this paper we present a set of failure handling heuristics based on cloud characteristics, which are implemented within SciMultaneous, a service-oriented architecture that manages re-executions of failed scientific workflow activities using runtime provenance data. Experimental results on clouds showed that SciMultaneous and its heuristics considerably increase workflow completion and reduce the total execution time (TET) of the workflow (even considering executions or reexecutions) up to 45%, when compared to a posteriori reexecution approaches. We analyze SciMultaneous´ behavior under a series of activity failures types and concluded that even a single activity failure could have a large detrimental effect on scientific workflow TET.

Keywords

cloud computing; operating systems (computers); parallel processing; service-oriented architecture; system recovery; virtual machines; HPC environments; SciMultaneous; TET; cloud characteristics; cloud computing; handling failures; high performance computing; large scale computing environments; parallel scientific workflows; service-oriented architecture; total execution time; virtual machine failures; workflow execution; Failure handling; Scientific Workflows;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:

Conference_Location

Salt Lake City, UT

Print_ISBN

978-1-4673-6218-4

Type

conf

DOI

10.1109/SC.Companion.2012.28

Filename

6495810