DocumentCode
927445
Title
The Effects of an ARMOR-based SIFT environment on the performance and dependability of user applications
Author
Whisnant, Keith ; Iyer, Ravishankar K. ; Kalbarczyk, Zbigniew T. ; Jones, Phillip H., III ; Rennels, David A. ; Some, Raphael
Author_Institution
Sun MicroSysterms Inc., San Diego, CA, USA
Volume
30
Issue
4
fYear
2004
fDate
4/1/2004 12:00:00 AM
Firstpage
257
Lastpage
277
Abstract
Few, distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. We present an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application´s execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes-coupled with object-based incremental checkpointing-were effective in preventing system failures by protecting dynamic data within the SIFT processes.
Keywords
aerospace computing; distributed processing; software fault tolerance; system recovery; ARMOR-based SIFT environment; Remote Exploration and Experimentation program; distributed systems; error detection; error recovery; object-based incremental checkpointing; software-implemented fault tolerance; spaceborne applications; user applications; Application software; Availability; Computer crashes; Fault tolerance; Fault tolerant systems; Helium; Propulsion; Protection; Space missions; Telescopes;
fLanguage
English
Journal_Title
Software Engineering, IEEE Transactions on
Publisher
ieee
ISSN
0098-5589
Type
jour
DOI
10.1109/TSE.2004.1274045
Filename
1274045
Link To Document