• DocumentCode
    3244309
  • Title

    Dynamic data replication: an approach to providing fault-tolerant shared memory clusters

  • Author

    Christodoulopoulou, Rosalia ; Azimi, Reza ; Bilas, Angelos

  • Author_Institution
    Dept. of Comput. Sci., Toronto Univ., Ont., Canada
  • fYear
    2003
  • fDate
    8-12 Feb. 2003
  • Firstpage
    203
  • Lastpage
    214
  • Abstract
    A challenging issue in today´s server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures.
  • Keywords
    computer network reliability; fault tolerant computing; local area networks; performance evaluation; protocols; shared memory systems; SMP nodes; SVM clusters; SVM protocol; dynamic data replication; fault-tolerant shared memory clusters; high-bandwidth interconnects; low-latency interconnects; node failures; performance evaluation; programming abstraction layer; reliability; server systems; shared virtual memory clusters; Availability; Buildings; Concurrent computing; Costs; Fault tolerance; Focusing; Monitoring; Operating systems; Protocols; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on
  • ISSN
    1530-0897
  • Print_ISBN
    0-7695-1871-0
  • Type

    conf

  • DOI
    10.1109/HPCA.2003.1183538
  • Filename
    1183538