• DocumentCode
    1081641
  • Title

    Using multi-stage and stratified sampling for inferring fault-coverage probabilities

  • Author

    Constantinescu, Cristian

  • Author_Institution
    Duke Univ., Durham, NC, USA
  • Volume
    44
  • Issue
    4
  • fYear
    1995
  • fDate
    12/1/1995 12:00:00 AM
  • Firstpage
    632
  • Lastpage
    639
  • Abstract
    Development of fault-tolerant computing systems requires accurate reliability modeling. Analytic, simulation, and hybrid models are commonly used for obtaining reliability measures. These measures are functions of component failure rates and fault-coverage (probabilities). Coverage provides information about the fault and error detection, isolation, and system recovery capabilities. This parameter can be derived by physical or simulated fault injection. Statistical inference has been used to extract meaningful information from sample observation. The problem of conducting fault injection experiments and statistically inferring the coverage from the information gathered in those experiments is addressed in this paper. We perform statistical experiments in a multi-dimensional space of events. In this way all major factors which influence the coverage (fault locations, timing characteristics of the fault, and the workload) are accounted for. Multi-stage, stratified, and combined multi-stage and stratified sampling are used in this paper for deriving the coverage. Equations of the mean, variance, and confidence interval of the coverage are provided. The statistical error produced by the injected faults which do not induce errors in the tested system (also known as the nonresponse problem) is considered, A program which emulates a typical fault environment was developed and four hypothetical systems are analyzed
  • Keywords
    fault tolerant computing; probability; reliability; reliability theory; system recovery; analytic models; component failure rates; confidence interval; error detection; fault detection; fault injection simulation; fault-coverage; fault-coverage probabilities; hybrid models; multi-stage sampling; nonresponse problem; reliability modeling; simulation models; statistical inference; stratified sampling; system recovery capabilities; timing characteristics; Analytical models; Computational modeling; Data mining; Fault detection; Fault location; Fault tolerant systems; Probability; Sampling methods; System recovery; Timing;
  • fLanguage
    English
  • Journal_Title
    Reliability, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9529
  • Type

    jour

  • DOI
    10.1109/24.475993
  • Filename
    475993