• DocumentCode
    2736599
  • Title

    Fault-tolerant switched local area networks

  • Author

    LeMahieu, Paul ; Bohossian, Vasken ; Bruck, Jehoshua

  • Author_Institution
    California Inst. of Technol., Pasadena, CA, USA
  • fYear
    1998
  • fDate
    30 Mar-3 Apr 1998
  • Firstpage
    747
  • Lastpage
    751
  • Abstract
    The RAIN (Reliable Array of Independent Nodes) project at Caltech is focusing on creating highly reliable distributed systems by leveraging commercially available personal computers, workstations and interconnect technologies. In particular the issue of reliable communication is addressed by introducing redundancy in the form of multiple network interfaces per compute node. When using compute nodes with multiple network connections the question of how to best connect these nodes to a given network of switches arises. We examine networks of switches (e.g. based on Myrinet technology) and focus on degree-two compute nodes (two network adaptor cards per node). Our primary goal is to create networks that are as resistant as possible to partitioning. Our main contributions are: (i) a construction for degree-2 compute nodes connected by a ring network of switches of degree 4 that can tolerate any 3 switch failures without partitioning the nodes into disjoint sets; (ii) a proof that this construction is optimal in the sense that no construction can tolerate more switch failures while avoiding partitioning; and (ii) generalizations of this construction to arbitrary switch and node degrees and to other switch networks, in particular to a fully-connected network of switches
  • Keywords
    computer network reliability; graph theory; local area networks; network interfaces; switching networks; Myrinet; RAIN; Reliable Array of Independent Nodes; degree-two compute nodes; fault-tolerant switched local area networks; multiple network interfaces; node partitioning; personal computers; redundancy; reliable communication; reliable distributed systems; ring network; switch failures; switch networks; workstations; Communication switching; Computer network reliability; Computer networks; Fault tolerance; Local area networks; Microcomputers; Rain; Switches; Telecommunication network reliability; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International ... and Symposium on Parallel and Distributed Processing 1998
  • Conference_Location
    Orlando, FL
  • ISSN
    1063-7133
  • Print_ISBN
    0-8186-8404-6
  • Type

    conf

  • DOI
    10.1109/IPPS.1998.670011
  • Filename
    670011