• DocumentCode
    3321577
  • Title

    Fault-tolerant replication management in large-scale distributed storage systems

  • Author

    Golding, Richard ; Borowsky, Elizabeth

  • Author_Institution
    Storage Syst. Program, Hewlett-Packard Labs., USA
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    144
  • Lastpage
    155
  • Abstract
    Failures of all forms happen: from losing single network packets to site-wide disasters. Since businesses rely heavily on their data, it is imperative that failures require minimal time and effort to repair and that the service interruption during the failure or repair period should be as short as possible. To this end, the ideal system should repair itself relying on humans only when absolutely necessary in the repair process. This paper describes one component of a self-healing storage system: the component that allows for automatic recovery of access to data when the power comes back on after a large-scale outage. Our failure recovery, protocol is part of a suite of modular protocols that make up the Palladio distributed storage system. This protocol guarantees that service will be repaired quickly and automatically when enough failures are repaired
  • Keywords
    distributed processing; fault tolerant computing; memory protocols; system recovery; Palladio distributed storage systems; automatic failure recovery; large-scale systems; modular protocols; protocol; self-repairing storage system; Access protocols; Earthquakes; Educational institutions; Fault detection; Fault tolerant systems; Hardware; Large-scale systems; Read only memory; Storage automation; Storms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 1999. Proceedings of the 18th IEEE Symposium on
  • Conference_Location
    Lausanne
  • ISSN
    1060-9857
  • Print_ISBN
    0-7695-0290-3
  • Type

    conf

  • DOI
    10.1109/RELDIS.1999.805091
  • Filename
    805091