DocumentCode
3103208
Title
MPI/FTTM: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
Author
Batchu, Rajanikanth ; Neelamegam, Jothi P. ; Cui, Zhenqian ; Beddhu, Murali ; Skjellum, Anthony ; Dandass, Yoginder ; Apte, Manoj
Author_Institution
MPI Software Technol. Inc., Starkville, MS, USA
fYear
2001
fDate
2001
Firstpage
26
Lastpage
33
Abstract
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing and scalable clusters. MPI/FT, the system described in the paper, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements and constraints. MPI codes are evolved to use MPI/FT features. Non-portable code for event handlers and recovery management is isolated. User-coordinated recovery, checkpointing, transparency and event handling, as well as evolvability of legacy MPI codes form key design criteria. Parallel self-checking threads address four levels of MPI implementation robustness, three of which are portable to any multithreaded MPI. A taxonomy of application types provides six initial fault-relevant models; user-transparent parallel nMR computation is thereby considered. Key concepts from MPI/RT-real-time MPI-are also incorporated into MPI/FT, with further overt support for MPI/RT and MPI/FT in applications possible in future
Keywords
client-server systems; message passing; parallel programming; software architecture; software fault tolerance; system recovery; MPI/FT; checkpointing; event handlers; event handling; fault-tolerant middleware; message passing; meta computing; parallel performance; parallel self-checking threads; performance-portable parallel computing; real-time MPI; recovery management; scalable clusters; wide-area network; Checkpointing; Communication standards; Fault tolerance; Fault tolerant systems; Middleware; Operating systems; Process control; Protocols; Quality of service; Taxonomy;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on
Conference_Location
Brisbane, Qld.
Print_ISBN
0-7695-1010-8
Type
conf
DOI
10.1109/CCGRID.2001.923171
Filename
923171
Link To Document