• DocumentCode
    1913338
  • Title

    Handling Failures in Parallel Scientific Workflows Using Clouds

  • Author

    Costa, Francois ; de Oliveira, Daniel ; Ocala, Kary ; Ogasawara, Eduardo ; Dias, Joana ; Mattoso, Marta

  • Author_Institution
    COPPE, Fed. Univ. of Rio de Janeiro, Rio de Janeiro, Brazil
  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    129
  • Lastpage
    139
  • Abstract
    Failures are common in High Performance Computing (HPC) environments and can significantly impact the performance of scientific workflows executing on top of these large scale computing environments. Computing clouds are being used as promising HPC environments. Although clouds offer several advantages such as elasticity and availability, failures are very frequent in this type of environment, where virtualization, instabilities and providers´ actions directly impact on workflow execution. In this way, activity failures are almost inevitable in clouds where virtual machine failures are a reality rather than a possibility. In this paper we present a set of failure handling heuristics based on cloud characteristics, which are implemented within SciMultaneous, a service-oriented architecture that manages re-executions of failed scientific workflow activities using runtime provenance data. Experimental results on clouds showed that SciMultaneous and its heuristics considerably increase workflow completion and reduce the total execution time (TET) of the workflow (even considering executions or reexecutions) up to 45%, when compared to a posteriori reexecution approaches. We analyze SciMultaneous´ behavior under a series of activity failures types and concluded that even a single activity failure could have a large detrimental effect on scientific workflow TET.
  • Keywords
    cloud computing; operating systems (computers); parallel processing; service-oriented architecture; system recovery; virtual machines; HPC environments; SciMultaneous; TET; cloud characteristics; cloud computing; handling failures; high performance computing; large scale computing environments; parallel scientific workflows; service-oriented architecture; total execution time; virtual machine failures; workflow execution; Failure handling; Scientific Workflows;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
  • Conference_Location
    Salt Lake City, UT
  • Print_ISBN
    978-1-4673-6218-4
  • Type

    conf

  • DOI
    10.1109/SC.Companion.2012.28
  • Filename
    6495810