DocumentCode :
1639464
Title :
Fault Tolerance Management for a Hierarchical GridRPC Middleware
Author :
Bouteiller, Aurelien ; Desprez, Frederic
Author_Institution :
CNRS, INRIA-UCBL, Lyon
fYear :
2008
Firstpage :
484
Lastpage :
491
Abstract :
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.
Keywords :
checkpointing; grid computing; middleware; software fault tolerance; transport protocols; TCP; automatic checkpoints; fault tolerance management; hierarchical GridRPC middleware; optimal fault detector; Detectors; Fault detection; Fault tolerance; Grid computing; High performance computing; Large-scale systems; Libraries; Middleware; Network servers; Processor scheduling; Checkpoint; Distributed algorithm; Failure detector; Fault tolerant; GridRPC;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
Conference_Location :
Lyon
Print_ISBN :
978-0-7695-3156-4
Electronic_ISBN :
978-0-7695-3156-4
Type :
conf
DOI :
10.1109/CCGRID.2008.14
Filename :
4534253
Link To Document :
بازگشت