• DocumentCode
    1639464
  • Title

    Fault Tolerance Management for a Hierarchical GridRPC Middleware

  • Author

    Bouteiller, Aurelien ; Desprez, Frederic

  • Author_Institution
    CNRS, INRIA-UCBL, Lyon
  • fYear
    2008
  • Firstpage
    484
  • Lastpage
    491
  • Abstract
    The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.
  • Keywords
    checkpointing; grid computing; middleware; software fault tolerance; transport protocols; TCP; automatic checkpoints; fault tolerance management; hierarchical GridRPC middleware; optimal fault detector; Detectors; Fault detection; Fault tolerance; Grid computing; High performance computing; Large-scale systems; Libraries; Middleware; Network servers; Processor scheduling; Checkpoint; Distributed algorithm; Failure detector; Fault tolerant; GridRPC;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-0-7695-3156-4
  • Electronic_ISBN
    978-0-7695-3156-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2008.14
  • Filename
    4534253