• DocumentCode
    725335
  • Title

    Rewiring 2 Links Is Enough: Accelerating Failure Recovery in Production Data Center Networks

  • Author

    Guo Chen ; Youjian Zhao ; Dan Pei ; Dan Li

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2015
  • fDate
    June 29 2015-July 2 2015
  • Firstpage
    569
  • Lastpage
    578
  • Abstract
    Failures are not uncommon in production data center networks (DCNs) nowadays, and it takes long time for the network to recover from a failure and find new forwarding paths, significantly impacting real time and interactive applications at the upper layer. The slow failure recovery is due to two primary reasons. First, there lacks immediate backup paths for downward links in DCN with multi-rooted tree topology. Second, distributed routing protocols in DCN take time to converge after failures. In this paper, we present a fault-tolerant DCN solution, called F2Tree, that can significantly improve the failure recovery time in current DCNs, only through a small amount of link rewiring and switch configuration changes. Because F2Tree does not change any existing software or hardware, it is readily deployed in production DCNs, where other existing proposals fail to achieve. Through testbed and emulation experiments, we show that F2Tree can greatly reduce the time of failure recovery by 78%. Our experimental results also show that, for partition-aggregate applications (popular in DCN) under various failure conditions, F2Tree reduces the ratio of deadline-missing requests by more than 96% compared to current DCNs.
  • Keywords
    computer centres; failure analysis; fault tolerant computing; real-time systems; F2Tree; backup paths; downward links; failure recovery acceleration; fault-tolerant DCN solution; forwarding paths; interactive applications; multirooted tree topology; partition-aggregate applications; production data center networks; realtime applications; Ports (Computers); Production; Redundancy; Routing; Routing protocols; Switches; Topology; Data center networks; Failure recovery;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems (ICDCS), 2015 IEEE 35th International Conference on
  • Conference_Location
    Columbus, OH
  • ISSN
    1063-6927
  • Type

    conf

  • DOI
    10.1109/ICDCS.2015.64
  • Filename
    7164942