Title :
Fault Tolerance in Cluster Federations with O2P-CF
Author :
Ropars, Thomas ; Morin, Christine
Author_Institution :
IRISA/Paris Project-Team, Paris
Abstract :
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Keywords :
fault tolerant computing; message passing; parallel processing; protocols; workstation clusters; O2P-CF protocol; cluster federations; fault tolerance; high performance computing systems; message passing applications; optimistic message logging protocol; pessimistic message logging protocol; Algorithm design and analysis; Delay; Fault tolerance; Fault tolerant systems; Grid computing; High performance computing; Large-scale systems; Libraries; Message passing; Protocols; Cluster federation; fault tolerance; message logging; message passing application;
Conference_Titel :
Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
Conference_Location :
Lyon
Print_ISBN :
978-0-7695-3156-4
Electronic_ISBN :
978-0-7695-3156-4
DOI :
10.1109/CCGRID.2008.76