Title :
Supporting nondeterministic execution in fault-tolerant systems
Author :
Slye, J. Hamilton ; Elnozahy, E.N.
Author_Institution :
Dept. of Electr. & Comput. Eng., Carnegie Mellon Univ., Pittsburgh, PA, USA
Abstract :
We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied
Keywords :
fault tolerant computing; memory protocols; shared memory systems; software fault tolerance; system recovery; DEC Alpha processor; asynchronous events; fault-tolerant systems; log-based rollback-recovery protocols; multithreading; nondeterministic execution; pre-failure state; software counter; uncontrolled nondeterminism; Application software; Checkpointing; Counting circuits; Fault tolerant systems; Multithreading; Production; Protocols; Registers; Resumes; Yarn;
Conference_Titel :
Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on
Conference_Location :
Sendai
Print_ISBN :
0-8186-7262-5
DOI :
10.1109/FTCS.1996.534611