• DocumentCode
    1761538
  • Title

    Self-Adapting Reliability in Distributed Software Systems

  • Author

    Brun, Yuriy ; Jae Young Bang ; Edwards, George ; Medvidovic, Nenad

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
  • Volume
    41
  • Issue
    8
  • fYear
    2015
  • fDate
    Aug. 1 2015
  • Firstpage
    764
  • Lastpage
    780
  • Abstract
    Developing modern distributed software systems is difficult in part because they have little control over the environments in which they execute. For example, hardware and software resources on which these systems rely may fail or become compromised and malicious. Redundancy can help manage such failures and compromises, but when faced with dynamic, unpredictable resources and attackers, the system reliability can still fluctuate greatly. Empowering the system with self-adaptive and self-managing reliability facilities can significantly improve the quality of the software system and reduce reliance on the developer predicting all possible failure conditions. We present iterative redundancy, a novel approach to improving software system reliability by automatically injecting redundancy into the system´s deployment. Iterative redundancy self-adapts in three ways: (1) by automatically detecting when the resource reliability drops, (2) by identifying unlucky parts of the computation that happen to deploy on disproportionately many compromised resources, and (3) by not relying on a priori estimates of resource reliability. Further, iterative redundancy is theoretically optimal in its resource use: Given a set of resources, iterative redundancy guarantees to use those resources to produce the most reliable version of that software system possible; likewise, given a desired increase in the system´s reliability, iterative redundancy guarantees achieving that reliability using the least resources possible. Iterative redundancy handles even the Byzantine threat model, in which compromised resources collude to attack the system. We evaluate iterative redundancy in three ways. First, we formally prove its self-adaptation, efficiency, and optimality properties. Second, we simulate it at scale using discrete event simulation. Finally, we modify the existing, open-source, volunteer-computing BOINC software system and deploy it on the globally-distributed PlanetLab testbed netwo- k to empirically evaluate that iterative redundancy is self-adaptive and more efficient than existing techniques.
  • Keywords
    discrete event simulation; distributed processing; public domain software; resource allocation; security of data; software quality; software reliability; system recovery; Byzantine threat model; compromise management; compromised resource collusion; discrete event simulation; distributed software systems; dynamic unpredictable resources; failure condition; failure management; globally-distributed PlanetLab testbed network; hardware resources; iterative redundancy; open-source volunteer-computing BOINC software system; optimality property; resource reliability estimate; self-adapting reliability; self-adaptive reliability; self-managing reliability; software resources; software system quality; system reliability; Computational modeling; Redundancy; Reliability engineering; Servers; Software reliability; Software systems; Redundancy; fault-tolerance; iterative redundancy; optimal redundancy; reliability; self-adaptation;
  • fLanguage
    English
  • Journal_Title
    Software Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0098-5589
  • Type

    jour

  • DOI
    10.1109/TSE.2015.2412134
  • Filename
    7058381