DocumentCode
1913338
Title
Handling Failures in Parallel Scientific Workflows Using Clouds
Author
Costa, Francois ; de Oliveira, Daniel ; Ocala, Kary ; Ogasawara, Eduardo ; Dias, Joana ; Mattoso, Marta
Author_Institution
COPPE, Fed. Univ. of Rio de Janeiro, Rio de Janeiro, Brazil
fYear
2012
fDate
10-16 Nov. 2012
Firstpage
129
Lastpage
139
Abstract
Failures are common in High Performance Computing (HPC) environments and can significantly impact the performance of scientific workflows executing on top of these large scale computing environments. Computing clouds are being used as promising HPC environments. Although clouds offer several advantages such as elasticity and availability, failures are very frequent in this type of environment, where virtualization, instabilities and providers´ actions directly impact on workflow execution. In this way, activity failures are almost inevitable in clouds where virtual machine failures are a reality rather than a possibility. In this paper we present a set of failure handling heuristics based on cloud characteristics, which are implemented within SciMultaneous, a service-oriented architecture that manages re-executions of failed scientific workflow activities using runtime provenance data. Experimental results on clouds showed that SciMultaneous and its heuristics considerably increase workflow completion and reduce the total execution time (TET) of the workflow (even considering executions or reexecutions) up to 45%, when compared to a posteriori reexecution approaches. We analyze SciMultaneous´ behavior under a series of activity failures types and concluded that even a single activity failure could have a large detrimental effect on scientific workflow TET.
Keywords
cloud computing; operating systems (computers); parallel processing; service-oriented architecture; system recovery; virtual machines; HPC environments; SciMultaneous; TET; cloud characteristics; cloud computing; handling failures; high performance computing; large scale computing environments; parallel scientific workflows; service-oriented architecture; total execution time; virtual machine failures; workflow execution; Failure handling; Scientific Workflows;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location
Salt Lake City, UT
Print_ISBN
978-1-4673-6218-4
Type
conf
DOI
10.1109/SC.Companion.2012.28
Filename
6495810
Link To Document