• DocumentCode
    927445
  • Title

    The Effects of an ARMOR-based SIFT environment on the performance and dependability of user applications

  • Author

    Whisnant, Keith ; Iyer, Ravishankar K. ; Kalbarczyk, Zbigniew T. ; Jones, Phillip H., III ; Rennels, David A. ; Some, Raphael

  • Author_Institution
    Sun MicroSysterms Inc., San Diego, CA, USA
  • Volume
    30
  • Issue
    4
  • fYear
    2004
  • fDate
    4/1/2004 12:00:00 AM
  • Firstpage
    257
  • Lastpage
    277
  • Abstract
    Few, distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. We present an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application´s execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes-coupled with object-based incremental checkpointing-were effective in preventing system failures by protecting dynamic data within the SIFT processes.
  • Keywords
    aerospace computing; distributed processing; software fault tolerance; system recovery; ARMOR-based SIFT environment; Remote Exploration and Experimentation program; distributed systems; error detection; error recovery; object-based incremental checkpointing; software-implemented fault tolerance; spaceborne applications; user applications; Application software; Availability; Computer crashes; Fault tolerance; Fault tolerant systems; Helium; Propulsion; Protection; Space missions; Telescopes;
  • fLanguage
    English
  • Journal_Title
    Software Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0098-5589
  • Type

    jour

  • DOI
    10.1109/TSE.2004.1274045
  • Filename
    1274045