• DocumentCode
    3047346
  • Title

    A hierarchical checkpointing protocol for parallel applications in cluster federations

  • Author

    Monnet, Sébastien ; Morin, Christine ; Badrinath, Ramamurthy

  • Author_Institution
    IRISA, Rennes, France
  • fYear
    2004
  • fDate
    26-30 April 2004
  • Firstpage
    211
  • Abstract
    Summary form only given. Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications.
  • Keywords
    discrete event simulation; parallel processing; performance evaluation; protocols; system recovery; workstation clusters; cluster federations; code coupling applications; discrete event simulator; hierarchical checkpointing protocol; node failure; parallel applications; performance evaluation; Application software; Bandwidth; Checkpointing; Delay; Discrete event simulation; ISO standards; Local area networks; Performance evaluation; Protocols; Storage area networks;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
  • Print_ISBN
    0-7695-2132-0
  • Type

    conf

  • DOI
    10.1109/IPDPS.2004.1303242
  • Filename
    1303242