• DocumentCode
    1640860
  • Title

    Fault Tolerance in Cluster Federations with O2P-CF

  • Author

    Ropars, Thomas ; Morin, Christine

  • Author_Institution
    IRISA/Paris Project-Team, Paris
  • fYear
    2008
  • Firstpage
    807
  • Lastpage
    812
  • Abstract
    Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
  • Keywords
    fault tolerant computing; message passing; parallel processing; protocols; workstation clusters; O2P-CF protocol; cluster federations; fault tolerance; high performance computing systems; message passing applications; optimistic message logging protocol; pessimistic message logging protocol; Algorithm design and analysis; Delay; Fault tolerance; Fault tolerant systems; Grid computing; High performance computing; Large-scale systems; Libraries; Message passing; Protocols; Cluster federation; fault tolerance; message logging; message passing application;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-0-7695-3156-4
  • Electronic_ISBN
    978-0-7695-3156-4
  • Type

    conf

  • DOI
    10.1109/CCGRID.2008.76
  • Filename
    4534308