• DocumentCode
    1591279
  • Title

    Probabilistic Failure Detection for Efficient Distributed Storage Maintenance

  • Author

    Tian, Jing ; Yang, Zhi ; Chen, Wei ; Zhao, Ben Y. ; Dai, Yafei

  • Author_Institution
    State Key Lab. for Adv Opt Comm. Syst & Networks, Peking Univ., Beijing
  • fYear
    2008
  • Firstpage
    147
  • Lastpage
    156
  • Abstract
    Distributed storage systems often use data replication to mask failures and guarantee high data availability. Node failures can be transient or permanent. While the system must generate new replicas to replace replica lost to permanent failures, it can save significant replication costs by not replicating following transient faults. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Traditional timeout approaches are difficult to tune and can introduce unnecessary replication. In this paper, we propose Protector, an algorithm that addresses this problem using network-wide statistical prediction. Our algorithm drastically improves prediction accuracy by making predictions across aggregate replica groups instead of single nodes. These estimates of the number of "live replicas" can guide efficient data replication policies. We prove that given data on node down times and the probability of permanent failures, the estimate given by our algorithm is more accurate than all alternatives. We describe two ways to obtain the failure probability function driven by models or traces. We conduct extensive simulations based both on synthetic and real traces, and show that Protector closely approximates the performance of a perfect "oracle" failure detector, while significantly outperforming timeout-based detectors using a wide range of parameters.
  • Keywords
    data analysis; distributed processing; failure analysis; storage management; Protector; data availability; data replication; distributed storage maintenance; probabilistic failure detection; Aggregates; Availability; Bandwidth; Computer network reliability; Costs; Detectors; Maintenance; Optimized production technology; Peer to peer computing; Protection; Data Recovery; Distributed Storage; Failure Detection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 2008. SRDS '08. IEEE Symposium on
  • Conference_Location
    Naples
  • ISSN
    1060-9857
  • Print_ISBN
    978-0-7695-3410-7
  • Type

    conf

  • DOI
    10.1109/SRDS.2008.28
  • Filename
    4690809