• DocumentCode
    2206187
  • Title

    Efficient Resubmission Strategies to Design Robust Grid Production Environments

  • Author

    Lingrand, Diane ; Montagnat, Johan

  • Author_Institution
    CNRS, Univ. of Nice - Sophia Antipolis, Sophia Antipolis, France
  • fYear
    2010
  • fDate
    7-10 Dec. 2010
  • Firstpage
    198
  • Lastpage
    205
  • Abstract
    Production grids exhibit high failure rates hampering the development of many large scale scientific applications. End users require robust experiment production environments ensuring efficient resubmission of failed tasks. Proper parameterization of resubmission strategies is a complex problem that depends on the non-stationary workload conditions experienced by the infrastructure. In order to determine optimal resubmission parameters, probabilistic models of the overhead experienced by grid jobs are defined, taking into account the distribution of faults as measured on the infrastructure. Two strategies that can be implemented on the client side are proposed. Their models are evaluated under variable workload conditions to assess their validity along time. Their results are compared and a trade-off between usability and model accuracy is discussed.
  • Keywords
    fault tolerance; grid computing; production engineering computing; experiment production environment; large scale scientific application; nonstationary workload condition; optimal resubmission parameter; probabilistic model; production grid; resubmission strategy; robust grid production environment; Computational modeling; Delay; Equations; Mathematical model; Monitoring; Probabilistic logic; Production; Fault tolerance; Grid computing; Probabilistic modeling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    e-Science (e-Science), 2010 IEEE Sixth International Conference on
  • Conference_Location
    Brisbane, QLD
  • Print_ISBN
    978-1-4244-8957-2
  • Electronic_ISBN
    978-0-7695-4290-4
  • Type

    conf

  • DOI
    10.1109/eScience.2010.11
  • Filename
    5693918