• DocumentCode
    3172847
  • Title

    How fail-stop are faulty programs?

  • Author

    Chandra, S. ; Chen, P.M.

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Michigan Univ., MI, USA
  • fYear
    1998
  • fDate
    23-25 June 1998
  • Firstpage
    240
  • Lastpage
    249
  • Abstract
    Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres´ transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.
  • Keywords
    relational databases; software fault tolerance; system recovery; transaction processing; virtual machines; Postgres database; SimOS; complete-machine simulator; experiment; fail-stop model; fault-tolerant systems; faulty programs; halt-on-failure; transaction processing; Application software; Computer bugs; Computer science; Condition monitoring; Fault detection; Kernel; Software systems; System software; Transaction databases; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
  • Conference_Location
    Munich, Germany
  • ISSN
    0731-3071
  • Print_ISBN
    0-8186-8470-4
  • Type

    conf

  • DOI
    10.1109/FTCS.1998.689475
  • Filename
    689475