Abstract :
An overview and classification of existing mechanisms supporting dependability and fault tolerance in distributed environments are presented. Important application-related mechanisms are described, including atomic actions and transactions, replicated remote procedure call, exception handling, data replication facilities, distributed recovery blocks, and distributed configuration management support. In addition, system-related mechanisms, including reliable basic communication facilities, adaptive routing, and redundant network interconnections, are discussed. Special consideration is given to real-time issues leading to advanced requirements. Finally, as an integration effort, a generic architecture to provide higher-level computational and structural fault tolerance support is presented. It includes a basic routine layer, a fault detection and diagnosis layer, and a supporting tool area. Its major goal is to integrate and extend existing mechanisms for distributed fault tolerance support
Keywords :
computer architecture; distributed processing; fault tolerant computing; adaptive routing; application-related mechanisms; atomic actions; data replication facilities; dependability; distributed applications; distributed configuration management support; distributed environments; distributed recovery blocks; exception handling; fault tolerance; generic architecture; redundant network interconnections; replicated remote procedure call; structural fault tolerance support; transactions; Application software; Computer architecture; Fault detection; Fault diagnosis; Fault tolerance; Fault tolerant systems; Routing; Runtime; Telecommunication network reliability; Telematics;