DocumentCode
1336230
Title
Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
Author
Bagchi, Saurabh ; Srinivasan, Balaji ; Whisnant, Keith ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar K.
Author_Institution
Coordinated Sci. Lab., Illinois Univ., Urbana, IL, USA
Volume
12
Issue
2
fYear
2000
Firstpage
203
Lastpage
224
Abstract
Proposes a hierarchical error detection framework for a software-implemented fault tolerance (SIFT) layer of a distributed system. A four-level error detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive fault tolerance in an environment of commercial off-the-shelf (COTS) system components and software. The design and implementation of a software-based distributed signature monitoring scheme, which is central to the proposed four-level hierarchy, is described. Both intra-level and inter-level optimizations that minimize the overhead of detection and are capable of adapting to runtime requirements are proposed. The paper presents results from a prototype implementation of two levels of the error detection hierarchy and results of a detailed simulation of the overall environment. The results indicate a substantial increase in availability due to the detection framework and help in understanding the tradeoffs between overhead and coverage for different combinations of techniques
Keywords
distributed processing; error detection; minimisation; software fault tolerance; Chameleon; SIFT environment; adaptive fault tolerance; availability; commercial off-the-shelf system components; coverage; distributed system; error detection overhead minimization; four-level error detection hierarchy; hierarchical error detection framework; inter-level optimization; intra-level optimization; prototype implementation; runtime requirements; simulation; software-based distributed signature monitoring scheme; software-implemented fault tolerance; speculative execution; Application software; Buildings; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Monitoring; Runtime; Software design; Software systems;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/69.842263
Filename
842263
Link To Document