• DocumentCode
    995608
  • Title

    A global-state-triggered fault injector for distributed system evaluation

  • Author

    Chandra, Ramesh ; Lefever, Ryan M. ; Joshi, Kaustubh R. ; Cukier, Michel ; Sanders, William H.

  • Author_Institution
    Dept. of Comput. Sci., Stanford Univ., CA, USA
  • Volume
    15
  • Issue
    7
  • fYear
    2004
  • fDate
    7/1/2004 12:00:00 AM
  • Firstpage
    593
  • Lastpage
    605
  • Abstract
    Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. We present Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial, view of the global state of the system, and a post-runtime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures.
  • Keywords
    distributed processing; fault tolerant computing; performance evaluation; synchronisation; system recovery; Loki; distributed system evaluation; global-state-based fault injection mechanism; offline clock synchronization; post-runtime analysis; user-specified performance; Air traffic control; Availability; Clocks; Helium; Monitoring; Performance analysis; State estimation; Synchronization; System testing; Web server; 65; Distributed systems; fault injection; measure estimation.; offline clock synchronization; partial view of global state; reliable systems; system evaluation;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2004.14
  • Filename
    1302100