DocumentCode :
2176200
Title :
FT-DRB: A Method for Tolerating Dynamic Faults in High-Speed Interconnection Networks
Author :
Zarza, Gonzalo ; Lugones, Diego ; Franco, Daniel ; Luque, Emilio
Author_Institution :
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma of Barcelona, Barcelona, Spain
fYear :
2010
fDate :
17-19 Feb. 2010
Firstpage :
77
Lastpage :
84
Abstract :
The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This paper introduces a novel fault-tolerant routing method provided with a new deadlock avoidance technique designed to solve an unbounded number of faults appearing at random during system operation. Our method provides escape paths for the stalled messages. In addition, the routing algorithm configures alternative paths to avoid the faulty areas taking advantage of communication path redundancy by means of multipath routing approaches. Deadlock avoidance is achieved by adding a small-sized queue and applying a simple set of actions when accessing output buffers with limited free space. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 96% compared to the fault-free scenarios.
Keywords :
distributed processing; fault tolerant computing; system recovery; FT-DRB; deadlock avoidance; dynamic fault tolerance; fault-tolerant routing method; high-performance computing; high-speed interconnection networks; multipath routing; network faults; Algorithm design and analysis; Computer applications; Computer architecture; Fault tolerance; Fault tolerant systems; Multiprocessor interconnection networks; Predictive models; Redundancy; Routing; System recovery; fault tolerance; interconnection networks; multipath routing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on
Conference_Location :
Pisa
ISSN :
1066-6192
Print_ISBN :
978-1-4244-5672-7
Electronic_ISBN :
1066-6192
Type :
conf
DOI :
10.1109/PDP.2010.65
Filename :
5452508
Link To Document :
بازگشت