• DocumentCode
    3200143
  • Title

    Efficient Process Replication for MPI Applications: Sharing Work between Replicas

  • Author

    Ropars, Thomas ; Lefray, Arnaud ; Dohyun Kim ; Schiper, Andre

  • Author_Institution
    Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland
  • fYear
    2015
  • fDate
    25-29 May 2015
  • Firstpage
    645
  • Lastpage
    654
  • Abstract
    With the increased failure rate expected in future extreme scale supercomputers, process replication might become a viable alternative to check pointing. By default, the workload efficiency of replication is limited to 50% because of the additional resources that have to be used to execute the replicas of the application´s processes. In this paper, we introduce intra-parallelization, a solution that avoids replicating all computation by introducing work-sharing between replicas. We show on a representative set of benchmarks that intra-parallelization allows achieving more than 50% efficiency without compromising fault tolerance.
  • Keywords
    application program interfaces; checkpointing; fault tolerant computing; message passing; parallel processing; MPI applications; checkpointing; extreme scale supercomputers; failure rate; intraparallelization; process replication; Checkpointing; Computer crashes; Context; Fault tolerance; Fault tolerant systems; Kernel; Protocols; High performance computing; fault tolerance; replication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
  • Conference_Location
    Hyderabad
  • ISSN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2015.29
  • Filename
    7161552