• DocumentCode
    2206095
  • Title

    Reliability analysis of a hardware and software fault tolerant parallel processor

  • Author

    Dugan, Joanne Bechta

  • Author_Institution
    Dept. of Electr. Eng., Virginia Univ., Charlottesville, VA, USA
  • fYear
    1994
  • fDate
    25-27 Oct 1994
  • Firstpage
    74
  • Lastpage
    83
  • Abstract
    Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input dependent. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. In this paper, a methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the fault tolerant parallel processor (FTPP). The models considers both transient and permanent faults, hardware and software faults, unrelated and related software faults, automatic recovery and reconfiguration. The parameter values for the software part of the model are determined from an experimental implementation of an N-version programming application. The parameter values chosen for the hardware part of the model are considered fairly typical
  • Keywords
    Markov processes; fault tolerant computing; parallel processing; performance evaluation; Markov models; N-version programming; fault tolerant parallel processor; quantitative dependability analysis; reliability analysis; system reconfiguration; transient faults; Application software; Computer errors; Fault tolerance; Fault tolerant systems; Fault trees; Hardware; Performance analysis; Software design; Software systems; Voting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reliable Distributed Systems, 1994. Proceedings., 13th Symposium on
  • Conference_Location
    Dana Point, CA
  • Print_ISBN
    0-8186-6575-0
  • Type

    conf

  • DOI
    10.1109/RELDIS.1994.336907
  • Filename
    336907