DocumentCode
2176200
Title
FT-DRB: A Method for Tolerating Dynamic Faults in High-Speed Interconnection Networks
Author
Zarza, Gonzalo ; Lugones, Diego ; Franco, Daniel ; Luque, Emilio
Author_Institution
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma of Barcelona, Barcelona, Spain
fYear
2010
fDate
17-19 Feb. 2010
Firstpage
77
Lastpage
84
Abstract
The intensive and continuous use of high-performance computing systems for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing algorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This paper introduces a novel fault-tolerant routing method provided with a new deadlock avoidance technique designed to solve an unbounded number of faults appearing at random during system operation. Our method provides escape paths for the stalled messages. In addition, the routing algorithm configures alternative paths to avoid the faulty areas taking advantage of communication path redundancy by means of multipath routing approaches. Deadlock avoidance is achieved by adding a small-sized queue and applying a simple set of actions when accessing output buffers with limited free space. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 96% compared to the fault-free scenarios.
Keywords
distributed processing; fault tolerant computing; system recovery; FT-DRB; deadlock avoidance; dynamic fault tolerance; fault-tolerant routing method; high-performance computing; high-speed interconnection networks; multipath routing; network faults; Algorithm design and analysis; Computer applications; Computer architecture; Fault tolerance; Fault tolerant systems; Multiprocessor interconnection networks; Predictive models; Redundancy; Routing; System recovery; fault tolerance; interconnection networks; multipath routing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on
Conference_Location
Pisa
ISSN
1066-6192
Print_ISBN
978-1-4244-5672-7
Electronic_ISBN
1066-6192
Type
conf
DOI
10.1109/PDP.2010.65
Filename
5452508
Link To Document