• DocumentCode
    2978499
  • Title

    Discovery and Routing of Degraded Fat-Trees

  • Author

    Bogdanski, Bartosz ; Johnsen, Bjorn Dag ; Reinemo, Sven-Arne ; Sem-Jacobsen, Frank Olaf

  • Author_Institution
    Oracle Corp., Oslo, Norway
  • fYear
    2012
  • fDate
    14-16 Dec. 2012
  • Firstpage
    697
  • Lastpage
    702
  • Abstract
    The fat-tree topology has become a popular choice for InfiniBand enterprise systems due to its deadlock freedom, fault-tolerance and full bisection bandwidth. In the HPC domain, InfiniBand fabric is used in almost 42% of the systems on the latest Top 500 list, and many of those systems are based on the fat-tree topology. Despite the popularity of the fat-tree topology, little research has been done to compare the behavior of InfiniBand routing algorithms on degraded fat-tree topologies. In this paper, we identify the weaknesses of the current fat-tree routing and propose enhancements that liberalize the restrictions imposed on the routed fabric. Furthermore, we present a thorough analysis of non-proprietary routing algorithms that are implemented in the InfiniBand Open Subnet Manager. Our results show that even though the performance of a fat-tree routed network deteriorates predictably with the number of failed links, fat-tree routing algorithm is still the best choice for severely degraded fat-tree fabrics.
  • Keywords
    computer network performance evaluation; fault tolerance; field buses; telecommunication links; telecommunication network routing; telecommunication network topology; HPC domain; InfiniBand enterprise systems; InfiniBand fabric; InfiniBand open subnet manager; InfiniBand routing algorithms; bisection bandwidth; deadlock freedom; degraded fat-tree discovery; degraded fat-tree fabrics; degraded fat-tree routing; fat-tree routed network performance; fat-tree topology; fault-tolerance; link failure; nonproprietary routing algorithms; Fabrics; Network topology; Ports (Computers); Routing; Switches; System recovery; Topology; InfiniBand; fat-tree; fault-tolerance; routing algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2012 13th International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-0-7695-4879-1
  • Type

    conf

  • DOI
    10.1109/PDCAT.2012.67
  • Filename
    6589362