Title :
Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters
Author :
Raikar, S. Pai ; Subramoni, H. ; Kandalla, K. ; Vienne, J. ; Panda, D.K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
Abstract :
The emerging trends of designing commodity-based supercomputing systems have a severe detrimental impact on the Mean-Time-Between-Failures (MTBF). The MTBF for typical HEC installations is currently estimated to be between eight hours and fifteen days [1]. Failures in the interconnect fabric account for a fair share of the total failures occurring in such systems. This will continue to degrade as system sizes become larger. Thus, it is highly desirable that next generation system architectures and software environments provide sophisticated network level fault-tolerance and fault-resilient solutions. In the past few years, the number of cores on processors has increased dramatically. To make efficient use of these machines it is necessary to provide the required bandwidth to all the cores. To keep up with the multi-core trend, current generation supercomputers and clusters are designed with multiple network cards (rails) to provide enhanced data transfer capabilities. Besides providing enhanced performance, such multi-rail networks can also be leveraged to provide network level fault resilience. This paper presents a design for a failover mechanism in a multi-rail scenario, for handling network failures and their recovery without compromising on performance. In a general message passing scenario, whenever there is a network failure, the entire job aborts. Our design allows the job to continue even when a network failure occurs, by using the remaining rails for communication. Once the rail recovers from the failure, we also propose a protocol to re-establish connections on that rail and resume normal operations. We experimentally demonstrate that our implementation adds very little overhead and is able to deliver good performance which is comparable to that of the other rails running in isolation. We also show that the recovery is immediate and is associated with no additional overhead. We also depict sustenance and reliability of the design by running application b- nchmarks with permanent failures.
Keywords :
fault tolerant computing; message passing; multiprocessing systems; parallel machines; HEC installation; MPI; MTBF; commodity-based supercomputing system; data transfer capability; failover mechanism; fault-resilient solution; interconnect fabric failure; mean-time-between-failures; message passing; multicore trend; multirail InfiniBand clusters; multirail network; multirail scenario; network card; network failover; network failure; network level fault resilience; network level fault-tolerance; network recovery; next generation system architecture; protocol; software environment; supercomputer; Benchmark testing; Fault tolerance; Fault tolerant systems; Libraries; Peer to peer computing; Rails; Servers; Failover; Fault Tolerance; MPI;
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0974-5
DOI :
10.1109/IPDPSW.2012.142