• DocumentCode
    3596085
  • Title

    Distributed wait state tracking for runtime MPI deadlock detection

  • Author

    Hilbrich, Tobias ; de Supinski, Bronis R. ; Nagel, Wolfgang E. ; Protze, Joachim ; Baier, Christine ; Muller, Matthias S.

  • Author_Institution
    Tech. Univ. Dresden, Dresden, Germany
  • fYear
    2013
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.
  • Keywords
    application program interfaces; benchmark testing; concurrency control; distributed algorithms; error detection; message passing; MPI blocking conditions; MUST; centralized detection algorithms; communication functions; complex benchmark suite; correctness checks; deadlock situations; distributed algorithm; distributed deadlock detection; distributed wait state tracking; message passing interface; runtime MPI deadlock detection; runtime error detection tools; scalable MPI deadlock detection; timeout approach; Algorithm design and analysis; Distributed algorithms; Runtime; Scalability; Semantics; Standards; System recovery;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
  • Print_ISBN
    978-1-4503-2378-9
  • Type

    conf

  • DOI
    10.1145/2503210.2503237
  • Filename
    6877449