• DocumentCode
    2194351
  • Title

    A Scalable Method for Signalling Dynamic Reconfiguration Events with OpenSM

  • Author

    Guay, Wei Lin ; Reinemo, Sven-Arne

  • Author_Institution
    Simula Res. Lab., Lysaker, Norway
  • fYear
    2011
  • fDate
    23-26 May 2011
  • Firstpage
    332
  • Lastpage
    341
  • Abstract
    Rerouting around faulty components, on-the-fly policy changes, and migration of jobs all require reconfiguration of data structures in the Queue Pairs residing in the hosts on an InfiniBand cluster. In addition to a proper implementation at the host, the subnet manager needs to implement a scalable method for signaling reconfiguration events to the hosts. In this paper we propose and evaluate three different implementations for signalling dynamic reconfiguration events with OpenSM. Through our evaluation we demonstrate a scalable solution for signalling host-side reconfiguration events in an InfiniBand network based on an example where dynamic network reconfiguration combined with a topology-agnostic routing function is used to avoid malfunctioning components. Through measurements on our test-cluster and an analytical study we show that our best proposal reduces reconfiguration latency by more than 90%and in certain situations eliminates it completely. Furthermore, the processing overhead in the subnet manager is shown to be minimal.
  • Keywords
    computer network reliability; data structures; fault tolerant computing; telecommunication network routing; InfiniBand cluster; OpenSM; data structures; dynamic network reconfiguration; faulty components; malfunctioning components; on-the-fly policy changes; queue pairs; scalable method; signalling dynamic reconfiguration events; subnet manager; topology-agnostic routing function; Arrays; Fault tolerance; Fault tolerant systems; Network topology; Routing; System recovery; Topology; Automatic path migration; Dynamic reconfiguration; InfiniBand; OpenSM; fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on
  • Conference_Location
    Newport Beach, CA
  • Print_ISBN
    978-1-4577-0129-0
  • Electronic_ISBN
    978-0-7695-4395-6
  • Type

    conf

  • DOI
    10.1109/CCGrid.2011.48
  • Filename
    5948624