• DocumentCode
    451251
  • Title

    MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

  • Author

    Bouteiller, Aurélien ; Cappello, Franck ; Hérault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric

  • Author_Institution
    LRI, Université de Paris Sud, Orsay, France
  • fYear
    2003
  • fDate
    15-21 Nov. 2003
  • Firstpage
    25
  • Lastpage
    25
  • Abstract
    Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.
  • Keywords
    Checkpointing; Clocks; Costs; Fault tolerance; High performance computing; Message passing; Permission; Programming profession; Protocols; Uniform resource locators;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Supercomputing, 2003 ACM/IEEE Conference
  • Print_ISBN
    1-58113-695-1
  • Type

    conf

  • DOI
    10.1109/SC.2003.10027
  • Filename
    1592928