• DocumentCode
    125514
  • Title

    Multi-homed Fat-Tree Routing with InfiniBand

  • Author

    Bogdanski, Bartosz ; Johnsen, Bjorn Dag ; Reinemo, Sven-Arne

  • Author_Institution
    Oracle Corp., Oslo, Norway
  • fYear
    2014
  • fDate
    12-14 Feb. 2014
  • Firstpage
    122
  • Lastpage
    129
  • Abstract
    For clusters where the topology consists of a fat-tree or more than one fat-tree combined into one subnet, there are several properties that the routing algorithms should support, beyond what exists today. One of the missing properties is that current fat-tree routing algorithm does not guarantee that each port on a multi-homed node is routed through redundant spines, even if these ports are connected to redundant leaves. As a consequence, in case of a spine failure, there is a small window where the node is unreachable until the subnet manager has rerouted to another spine. In this paper, we discuss the need for independent routes for multi-homed nodes in fat-trees by providing real-life examples when a single point of failure leads to complete outage of a multi-port node. We present and implement methods that may be used to alleviate this problem and perform simulations that demonstrate improvements in performance, scalability, availability and predictability of InfiniBand fat-tree topologies. We show that our methods not only increase the performance by up to 52.6%, but also, and more importantly, that there is no downtime associated with spine switch failure.
  • Keywords
    computer network performance evaluation; system recovery; telecommunication network routing; telecommunication network topology; telecommunication switching; trees (mathematics); workstation clusters; InfiniBand; availability improvement; multihomed fat-tree routing algorithm; multihomed node; multiport node; performance improvement; predictability improvement; redundant leaves; redundant spines; scalability improvement; spine switch failure; Algorithm design and analysis; Fabrics; Ports (Computers); Proposals; Routing; System recovery; Topology; fat-tree; mFtree; multihoming; routing; two-port nodes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on
  • Conference_Location
    Torino
  • ISSN
    1066-6192
  • Type

    conf

  • DOI
    10.1109/PDP.2014.22
  • Filename
    6787262