• DocumentCode
    454306
  • Title

    Reachability-based fault-tolerant routing

  • Author

    Montanana, J.M. ; Flich, J. ; Robles, A. ; Duato, J.

  • Author_Institution
    Dept. of Comput. Eng., Univ. Politecnica de Valencia
  • Volume
    1
  • fYear
    0
  • fDate
    0-0 0
  • Abstract
    Clusters of PCs are being used as cost-effective alternative to large parallel computers. In most of them it is critical to keep the system running even in the presence of faults. As the number of nodes increases in these systems, the interconnection network grows accordingly. Along with the increase in components the probability of faults increases dramatically, and thus, fault-tolerance in the system, in general, and in the interconnection network, in particular, plays a key role. An interesting approach to provide fault-tolerance consists of migrating on fly the paths affected by the failure to new fault-free paths. In this paper, we propose a simple and effective fault-tolerant routing methodology, referred to as reachability based fault tolerant routing (RFTR), that can be applied to any topology. RFTR builds new alternative paths by joining subpaths extracted from the set of already computed paths, thus being time-efficient. In order to avoid deadlocks, RFTR performs, if required, a virtual channel transition on the subpath union. As an example of applicability, in this paper we apply RFTR to InfiniBand. Evaluation results on tori show that RFTR exhibits a low computation cost and does not degrade performance significantly
  • Keywords
    fault tolerant computing; reachability analysis; telecommunication network routing; workstation clusters; PC clusters; interconnection network; parallel computers; reachability-based fault-tolerant routing; virtual channel transition; Computational efficiency; Concurrent computing; Degradation; Fault tolerance; Fault tolerant systems; Multiprocessor interconnection networks; Personal communication networks; Routing; System recovery; Topology;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on
  • Conference_Location
    Minneapolis, MN
  • ISSN
    1521-9097
  • Print_ISBN
    0-7695-2612-8
  • Type

    conf

  • DOI
    10.1109/ICPADS.2006.89
  • Filename
    1655699