DocumentCode
3172847
Title
How fail-stop are faulty programs?
Author
Chandra, S. ; Chen, P.M.
Author_Institution
Dept. of Electr. Eng. & Comput. Sci., Michigan Univ., MI, USA
fYear
1998
fDate
23-25 June 1998
Firstpage
240
Lastpage
249
Abstract
Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the core of the fail-stop model. Unfortunately, little experimental data exists on whether or not program failures follow the fail-stop model. This paper describes a tool, based on the SimOS complete-machine simulator that can trace how faults propagate through memory, disk, and functions. Using this tool on the Postgres database system, we conduct a controlled experiment to measure how often faulty programs violate the fail-stop model. We find that a significant number of faults (7%) violate the fail-stop model by writing incorrect data to stable storage before halting. We then apply Postgres´ transaction mechanism to undo recent changes before a crash and find that transactions reduce fail-stop violations by a factor of 3.
Keywords
relational databases; software fault tolerance; system recovery; transaction processing; virtual machines; Postgres database; SimOS; complete-machine simulator; experiment; fail-stop model; fault-tolerant systems; faulty programs; halt-on-failure; transaction processing; Application software; Computer bugs; Computer science; Condition monitoring; Fault detection; Kernel; Software systems; System software; Transaction databases; Workstations;
fLanguage
English
Publisher
ieee
Conference_Titel
Fault-Tolerant Computing, 1998. Digest of Papers. Twenty-Eighth Annual International Symposium on
Conference_Location
Munich, Germany
ISSN
0731-3071
Print_ISBN
0-8186-8470-4
Type
conf
DOI
10.1109/FTCS.1998.689475
Filename
689475
Link To Document