• DocumentCode
    3103208
  • Title

    MPI/FTTM: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing

  • Author

    Batchu, Rajanikanth ; Neelamegam, Jothi P. ; Cui, Zhenqian ; Beddhu, Murali ; Skjellum, Anthony ; Dandass, Yoginder ; Apte, Manoj

  • Author_Institution
    MPI Software Technol. Inc., Starkville, MS, USA
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    26
  • Lastpage
    33
  • Abstract
    MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing and scalable clusters. MPI/FT, the system described in the paper, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated. User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multithreaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT-real-time MPI-are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future
  • Keywords
    client-server systems; message passing; parallel programming; software architecture; software fault tolerance; system recovery; MPI/FT; checkpointing; event handlers; event handling; fault-tolerant middleware; message passing; meta computing; parallel performance; parallel self-checking threads; performance-portable parallel computing; real-time MPI; recovery management; scalable clusters; wide-area network; Checkpointing; Communication standards; Fault tolerance; Fault tolerant systems; Middleware; Operating systems; Process control; Protocols; Quality of service; Taxonomy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on
  • Conference_Location
    Brisbane, Qld.
  • Print_ISBN
    0-7695-1010-8
  • Type

    conf

  • DOI
    10.1109/CCGRID.2001.923171
  • Filename
    923171