Title :
Architecture of LA-MPI, a network-fault-tolerant MPI
Author :
Aulwes, Rob T. ; Daniel, David J. ; Desai, Nehal N. ; Graham, Richard L. ; Risinger, L. Dean ; Taylor, Mark A. ; Woodall, Timothy S. ; Sukalski, Mitchel W.
Author_Institution :
Adv. Comput. Lab., Los Alamos Nat. Lab., NM, USA
Abstract :
Summary form only given. We discuss the unique architectural elements of the Los Alamos message passing interface (LA-MPI), a high-performance, network-fault-tolerant, thread-safe MPI library. LA-MPI is designed for use on terascale clusters which are inherently unreliable due to their sheer number of system components and trade-offs between cost and performance. We examine in detail the design concepts used to implement LA-MPI. These include reliability features, such as application-level checksumming, message retransmission, and automatic message rerouting. Other key performance enhancing features, such as concurrent message routing over multiple, diverse network adapters and protocols, and communication-specific optimizations (e.g., shared memory) are examined.
Keywords :
fault tolerant computing; message passing; open systems; workstation clusters; Los Alamos message passing interface; application-level checksumming; automatic message rerouting; concurrent message routing; message retransmission; network adapters; network protocols; network-fault-tolerant MPI; reliability; shared memory system; terascale clusters; Application software; Computer architecture; Fault tolerance; Fault tolerant systems; High performance computing; Laboratories; Libraries; Message passing; Protocols; Telecommunication network reliability;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
Print_ISBN :
0-7695-2132-0
DOI :
10.1109/IPDPS.2004.1302920