DocumentCode
1478367
Title
Dynamic Fault Tolerance in Fat Trees
Author
Sem-Jacobsen, Frank Olaf ; Skeie, Tor ; Lysne, Olav ; Duato, José
Author_Institution
Simula Res. Lab., Lysaker, Norway
Volume
60
Issue
4
fYear
2011
fDate
4/1/2011 12:00:00 AM
Firstpage
508
Lastpage
525
Abstract
Fat trees are a very common communication architecture in current large-scale parallel computers. The probability of failure in these systems increases with the number of components. We present a routing method for deterministically and adaptively routed fat trees, applicable to both distributed and source routing, that is able to handle several concurrent faults and that transparently returns to the original routing strategy once the faulty components have recovered. The method is local and dynamic, completely masking the fault from the rest of the system. It only requires a small extra functionality in the switches to handle rerouting packets around a fault. The method guarantees connectedness and deadlock and livelock freedom for up to k -1 benign simultaneous switch and/or link faults where k is half the number of ports in the switches. Our simulation experiments show a graceful degradation of performance as more faults occur. Furthermore, we demonstrate that for most fault combinations, our method will even be able to handle significantly more faults beyond the k -1 limit with high probability.
Keywords
failure analysis; fault tolerant computing; large-scale systems; network routing; parallel architectures; parallel machines; trees (mathematics); communication architecture; concurrent fault; distributed routing; dynamic fault tolerance; failure probability; fat tree; large-scale parallel computer; rerouting packet handling; routing method; source routing; Fault tolerance; Fault tolerant systems; Heuristic algorithms; Network topology; Routing; System recovery; Topology; Fat trees; adaptive routing.; deterministic routing; dynamic fault tolerance; k-ary n-trees;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/TC.2010.97
Filename
5453356
Link To Document