• DocumentCode
    2176200
  • Title

    FT-DRB: A Method for Tolerating Dynamic Faults in High-Speed Interconnection Networks

  • Author

    Zarza, Gonzalo ; Lugones, Diego ; Franco, Daniel ; Luque, Emilio

  • Author_Institution
    Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma of Barcelona, Barcelona, Spain
  • fYear
    2010
  • fDate
    17-19 Feb. 2010
  • Firstpage
    77
  • Lastpage
    84
  • Abstract
    The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This paper introduces a novel fault-tolerant routing method provided with a new deadlock avoidance technique designed to solve an unbounded number of faults appearing at random during system operation. Our method provides escape paths for the stalled messages. In addition, the routing algorithm configures alternative paths to avoid the faulty areas taking advantage of communication path redundancy by means of multipath routing approaches. Deadlock avoidance is achieved by adding a small-sized queue and applying a simple set of actions when accessing output buffers with limited free space. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 96% compared to the fault-free scenarios.
  • Keywords
    distributed processing; fault tolerant computing; system recovery; FT-DRB; deadlock avoidance; dynamic fault tolerance; fault-tolerant routing method; high-performance computing; high-speed interconnection networks; multipath routing; network faults; Algorithm design and analysis; Computer applications; Computer architecture; Fault tolerance; Fault tolerant systems; Multiprocessor interconnection networks; Predictive models; Redundancy; Routing; System recovery; fault tolerance; interconnection networks; multipath routing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on
  • Conference_Location
    Pisa
  • ISSN
    1066-6192
  • Print_ISBN
    978-1-4244-5672-7
  • Electronic_ISBN
    1066-6192
  • Type

    conf

  • DOI
    10.1109/PDP.2010.65
  • Filename
    5452508