Author_Institution :
Intel Corp., Hillsboro, OR, USA
Abstract :
Gracefully degrading systems represent a cost effective alternative to the massively redundant fault-tolerant computing systems. Assessing the effectiveness of these systems requires combined reliability and performance measures such as computational availability, performability and accumulated reward. This paper compares, for the first time, two numerical algorithms used for assessing the complementary distribution of the accumulated reward and the expected accumulated reward, respectively. Both methods are employed for analyzing a multiprocessor server. The first one, based on Laplace transforms, numerical evaluation of eigenvalues, and analytical and numerical inversion of the Laplace transforms, gives accurate results for low values of the accumulated reward. However, instability of the numerical inversion routine negatively affects the results when the accumulated reward approaches the maximum attainable performance of the system. The second method, which relies on randomization, proves to be insensitive to the performance level reached by the system. This approach is used to analyze the impact of the fault/error coverage probability, spare processing units, repair, and performance degradation on the expected accumulated reward of the server. We conclude that the randomization based method is a more accurate approach for assessing the reliability and performance of gracefully degrading systems
Keywords :
Laplace transforms; eigenvalues and eigenfunctions; fault tolerant computing; file servers; multiprocessing systems; numerical stability; redundancy; reliability; Laplace transforms; accumulated reward; complementary distribution; computational availability; eigenvalues; expected accumulated reward; fault/error coverage probability; gracefully degrading computer systems; massively redundant fault-tolerant computing systems; multiprocessor server; numerical algorithms; numerical inversion routine instability; numerical techniques; performability; performance assessment; performance degradation; randomization; reliability assessment; repair; spare processing units; Availability; Costs; Degradation; Distribution functions; Eigenvalues and eigenfunctions; Fault tolerant systems; Finite wordlength effects; Parallel machines; Performance analysis; Performance evaluation;