• DocumentCode
    2999609
  • Title

    Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters

  • Author

    Raikar, S. Pai ; Subramoni, H. ; Kandalla, K. ; Vienne, J. ; Panda, D.K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2012
  • fDate
    21-25 May 2012
  • Firstpage
    1160
  • Lastpage
    1167
  • Abstract
    The emerging trends of designing commodity-based supercomputing systems have a severe detrimental impact on the Mean-Time-Between-Failures (MTBF). The MTBF for typical HEC installations is currently estimated to be between eight hours and fifteen days [1]. Failures in the interconnect fabric account for a fair share of the total failures occurring in such systems. This will continue to degrade as system sizes become larger. Thus, it is highly desirable that next generation system architectures and software environments provide sophisticated network level fault-tolerance and fault-resilient solutions. In the past few years, the number of cores on processors has increased dramatically. To make efficient use of these machines it is necessary to provide the required bandwidth to all the cores. To keep up with the multi-core trend, current generation supercomputers and clusters are designed with multiple network cards (rails) to provide enhanced data transfer capabilities. Besides providing enhanced performance, such multi-rail networks can also be leveraged to provide network level fault resilience. This paper presents a design for a failover mechanism in a multi-rail scenario, for handling network failures and their recovery without compromising on performance. In a general message passing scenario, whenever there is a network failure, the entire job aborts. Our design allows the job to continue even when a network failure occurs, by using the remaining rails for communication. Once the rail recovers from the failure, we also propose a protocol to re-establish connections on that rail and resume normal operations. We experimentally demonstrate that our implementation adds very little overhead and is able to deliver good performance which is comparable to that of the other rails running in isolation. We also show that the recovery is immediate and is associated with no additional overhead. We also depict sustenance and reliability of the design by running application b- nchmarks with permanent failures.
  • Keywords
    fault tolerant computing; message passing; multiprocessing systems; parallel machines; HEC installation; MPI; MTBF; commodity-based supercomputing system; data transfer capability; failover mechanism; fault-resilient solution; interconnect fabric failure; mean-time-between-failures; message passing; multicore trend; multirail InfiniBand clusters; multirail network; multirail scenario; network card; network failover; network failure; network level fault resilience; network level fault-tolerance; network recovery; next generation system architecture; protocol; software environment; supercomputer; Benchmark testing; Fault tolerance; Fault tolerant systems; Libraries; Peer to peer computing; Rails; Servers; Failover; Fault Tolerance; MPI;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4673-0974-5
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2012.142
  • Filename
    6270768