• DocumentCode
    125562
  • Title

    Supporting the Development of Resilient Message Passing Applications Using Simulation

  • Author

    Naughton, Thomas ; Engelmann, Christian ; Vallee, Geoffroy ; Bohm, Swen

  • Author_Institution
    Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
  • fYear
    2014
  • fDate
    12-14 Feb. 2014
  • Firstpage
    271
  • Lastpage
    278
  • Abstract
    An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.
  • Keywords
    application program interfaces; concurrency control; discrete event simulation; hardware-software codesign; message passing; parallel architectures; parallel processing; software fault tolerance; software performance evaluation; ABFT solutions; HPC architectures; HPC hardware-software codesign; MPI Fault Tolerance Working Group; MPI applications; ULFM extensions; algorithm-based fault tolerance; concurrent MPI ranks; extreme-scale simulator; fault-tolerant MPI extensions; high-performance computing; lightweight parallel discrete event simulation; message passing interface; performance evaluation; resilient message passing application development; user-level failure mitigation extensions; xSim; Computer architecture; Fault tolerance; Fault tolerant systems; Laboratories; Message passing; Resilience; Runtime; Algorithm-based Fault Tolerance; High-performance Computing; Message Passing Interface; Parallel Discrete Event Simulation; Performance Prediction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on
  • Conference_Location
    Torino
  • ISSN
    1066-6192
  • Type

    conf

  • DOI
    10.1109/PDP.2014.74
  • Filename
    6787286