Author :
Cleveland, Jeffrey ; Loyall, Joseph ; Hanna, J.
Author_Institution :
Raytheon BBN Technol., Cambridge, MA, USA
Abstract :
Fault tolerance and survivability are important aspects of many business-critical and mission-critical systems but it is still difficult to assess how well fault tolerance techniques work. Ensuring fault tolerance in military communication systems is particularly important due to the inevitability of hardware failure, data corruption, or service interruption and the risk that cascading failures could jeopardize critical military operations. In this paper, we present a fault tolerance assessment framework designed for distributed systems that provides automated injection of faults without changes to client or server code and automated assessment of whether the injected faults are tolerated. The framework applies aspect-oriented programming, specifically AspectJ, to inject faults and weave in assessment criteria. The framework supports both assessing the tolerance of direct faults, such as crashes and corruption, like traditional fault injectors, and conditional faults, which can be probabilistically, randomly, or periodically injected at runtime. This latter class of faults is not historically supported by fault injectors, but enables the assessment of tolerance to many important classes of faults threatening modern distributed military communication systems, including timing faults, resource exhaustion (e.g., Denial-of-service), and integrity faults that are traditionally difficult to tolerate and assess. Additionally, the framework provides a centralized view for users enabling them to monitor and script coordinated tests comprising performance metrics and injected faults spanning services, applications, and hosts.
Keywords :
aspect-oriented programming; military communication; military computing; software fault tolerance; software metrics; AspectJ; aspect-oriented programming; automated fault injection; business-critical systems; conditional fault tolerance assessment; crashes; direct fault tolerance assessment; distributed military communication systems; fault tolerance assessment framework; injected fault spanning services; mission-critical systems; performance metrics; Fault tolerance; Fault tolerant systems; Java; Measurement; Programming; Prototypes; Weaving; aspect-oriented programming; assessment; fault tolerance; survivability; testing;