• DocumentCode
    2364006
  • Title

    Tolerating network failures in system area networks

  • Author

    Tang, Jeffrey ; Bilas, Angelos

  • Author_Institution
    Dept. of Comput. Sci., Toronto Univ., Ont., Canada
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    121
  • Lastpage
    130
  • Abstract
    In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware-level retransmission scheme to tolerate transient failures and an on-demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low-level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state-of-the art cluster and both micro-benchmarks and real applications from the SPLASH-2 suite.
  • Keywords
    computer network reliability; fault tolerant computing; firmware; local area networks; performance evaluation; SPLASH-2 suite; cluster; firmware-level retransmission scheme; low-level implementations; microbenchmarks; on-demand network mapping scheme; permanent network failure tolerance; protocol parameters; system area networks; system performance; transient network failure tolerance; Access protocols; Computer science; Error analysis; Intelligent networks; Microprogramming; Multiprocessor interconnection networks; Network interfaces; Performance analysis; Switches; System performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing, 2002. Proceedings. International Conference on
  • ISSN
    0190-3918
  • Print_ISBN
    0-7695-1677-7
  • Type

    conf

  • DOI
    10.1109/ICPP.2002.1040866
  • Filename
    1040866