• DocumentCode
    3637577
  • Title

    Handling Crash and Software Faults Efficiently in Distributed Event Stream Processing

  • Author

    Andrey Brito;Stefan Weigert;Martin Süßkraut;Christof Fetzer;Pascal Felber

  • Author_Institution
    Syst. Eng. Group, Tech. Univ. Dresden, Dresden, Germany
  • fYear
    2010
  • Firstpage
    164
  • Lastpage
    172
  • Abstract
    Active replication is a common approach to handle failures in distributed systems, including Event Stream Processing (ESP) systems. However, one weakness of conventional active replication is that replicas, being equal and in the same state, are susceptible to common-mode crashes due to software bugs. We propose a new approach to active replication that assumes a failure model stronger than fail-stop but weaker than models permitting arbitrary failures. We combine transactional memory and extended runtime checking to achieve: (i) low processing latency in failure-free runs by allowing downstream nodes to use speculative results and, thus, to circumvent the overhead added by the extended runtime checks; (ii) reduce the MTTR by enabling localized rollbacks (with word granularity) in several cases. We show that major limitations of n-variant active replication (e.g., multi-threading support, complex and slow recovery) can be overcome and tolerance to software bugs is orthogonal to Byzantine fault tolerance.
  • Keywords
    "Software","Computer bugs","Runtime","Protocols","Fault tolerance","Fault tolerant systems"
  • Publisher
    ieee
  • Conference_Titel
    Dependability (DEPEND), 2010 Third International Conference on
  • Print_ISBN
    978-1-4244-7530-8
  • Type

    conf

  • DOI
    10.1109/DEPEND.2010.32
  • Filename
    5562833