DocumentCode :
21918
Title :
Improving the Reliability of MPI Libraries via Message Flow Checking
Author :
Zhezhe Chen ; Qi Gao ; Wenbin Zhang ; Feng Qin
Author_Institution :
Ohio State Univ., Columbus, OH, USA
Volume :
24
Issue :
3
fYear :
2013
fDate :
Mar-13
Firstpage :
535
Lastpage :
549
Abstract :
Despite the success of the Message Passing Interface (MPI), many MPI libraries have suffered from software bugs. These bugs severely impact the productivity of a large number of users, causing program failures or other errors. As a result, MPI application developers often have to spend days or weeks in vain debugging their own code. To address this daunting problem, this paper presents a new method called FlowChecker, which detects communication related bugs in MPI libraries. First, FlowChecker extracts program intentions of message passing (MP-intentions), which specify messages to be delivered from the sources to the destinations. Then FlowChecker tracks the message flows that actually occur in the underlying MPI libraries. Finally, FlowChecker checks whether the messages are correctly delivered from the sources to the destinations by comparing the message flows against the MP-intentions. If a mismatch is found, FlowChecker reports a bug and provides diagnostic information to help MPI library developers to understand and fix it. We have built a FlowChecker prototype on Linux and evaluated it with five real-world and two injected bug cases in three widely used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all seven evaluated bug cases. Additionally, it provides useful diagnostic information for narrowing down or even pinpointing root causes of the bugs. Moreover, our experiments with High Performance Linpack and NAS Parallel Benchmarks show that FlowChecker induces low runtime overhead (0.9-5.6 percent on Open MPI, 0.9-8.1 percent on MPICH2, and 1.6-9.7 percent on MVAPICH2).
Keywords :
Linux; application program interfaces; message passing; program debugging; software libraries; software reliability; FlowChecker; High Performance Linpack; Linux; MP-intentions; MPI library reliability; MVAPICH2; NAS parallel benchmarks; Open MPI; message flow checking; message passing interface; message passing program intentions; software bugs; Computer bugs; Libraries; Message passing; Runtime; Semantics; Software; Tracking; Software reliability; bug detection; message passing interfaces;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2012.127
Filename :
6416896
Link To Document :
بازگشت