Title :
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver
Author :
Ali, Md Mortuza ; Southern, James ; Strazdins, Peter ; Harding, Brendan
Author_Institution :
Res. Sch. of Comput. Sci., Australian Nat. Univ., Canberra, ACT, Australia
Abstract :
A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum´s Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of- fault-tolerant applications by means of the Open MPI ULFM standard.
Keywords :
application program interfaces; fault tolerant computing; message passing; open systems; partial differential equations; resource allocation; system recovery; 2D partial differential equations; MPI forum; MPI processes; Open MPI ULFM standard; PDE solver; ULFM proposal; alternate sparse grid combination technique; application level fault recovery; approximated data recovery; checkpointing; data recovery overhead; draft ULFM; fault tolerance working group; fault-tolerant applications; fault-tolerant implementation; fault-tolerant open MPI; fault-tolerant version; faulty communicator reconstruction time; load balancing; near-exact copy; near-exact replication method; open message passing interface; replicated data; user level failure mitigation; user level failure recovery; very low disk write latency; Approximation methods; Educational institutions; Fault tolerance; Fault tolerant systems; Libraries; Standards; Synchronization; PDE solver; ULFM; approximation error; fault tolerance; process failure recovery; sparse grid combination;
Conference_Titel :
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-4117-9
DOI :
10.1109/IPDPSW.2014.132