Title :
On low-cost error containment and recovery methods for guarded software upgrading
Author :
Tai, Ann T. ; Tso, Kam S. ; Alkalai, Leon ; Chau, Savio N. ; Sanders, William H.
Author_Institution :
IA Tech. Inc., Los Angeles, CA, USA
Abstract :
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). We focus on a low-cost approach to error containment and recovery for GSU. To ensure low development cost, we exploit inherent system resource redundancies as the fault tolerance means. In order to mitigate the effect of residual software faults at low performance cost, we take a crucial step in devising error containment and recovery methods by introducing the confidence-driven notion. This notion complements the message-driven (or communication-induced) approach employed by a number of existing checkpointing protocols for tolerating hardware faults. In particular, we discriminate between the individual software components with respect to our confidence in their reliability and keep track of changes of our confidence (due to knowledge about potential process state contamination) in particular processes. This, in turn, enables the individual processes in the spaceborne distributed system to make decisions locally at run-time, on whether to establish a checkpoint upon message passing and whether to roll back or roll forward during error recovery. The resulting message-driven confidence-driven approach enables cost-effective checkpointing and cascading-rollback free recovery
Keywords :
aerospace computing; message passing; software fault tolerance; software maintenance; software performance evaluation; system recovery; cascading-rollback free recovery; checkpointing; confidence-driven notion; error containment; error recovery; fault tolerance; guarded software upgrading; low-cost error containment; message passing; message-driven approach; performance; protocols; residual software faults; run-time; software components; software reliability; spaceborne computing systems; spaceborne distributed system; system recovery methods; system resource redundancies; Checkpointing; Contamination; Costs; Fault tolerant systems; Hardware; Message passing; Protocols; Redundancy; Runtime; Software performance;
Conference_Titel :
Distributed Computing Systems, 2000. Proceedings. 20th International Conference on
Conference_Location :
Taipei
Print_ISBN :
0-7695-0601-1
DOI :
10.1109/ICDCS.2000.840969