• DocumentCode
    2611286
  • Title

    Recovery schemes for mesh arrays utilizing dedicated spares

  • Author

    Goldberg, S.R. ; Upadhyaya, S.J. ; Fuchs, W.K.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., State Univ. of New York, Buffalo, NY, USA
  • fYear
    1996
  • fDate
    6-8 Nov 1996
  • Firstpage
    318
  • Lastpage
    326
  • Abstract
    Error recovery capability is examined in processing arrays that employ spare nodes for fault tolerance. Spares can provide fault tolerance to high-performance single-package arrays, where it is not feasible to repair faulty subsystems. The cost of such a fault-tolerance solution, redundant hardware that idles until needed, may not be practical. Manufacturers must be offered hardware solutions to fault tolerance that provide useful work at all times. In this paper, new schemes are presented in which idling spares can be utilized to improve error recovery. Without expedient error recovery, computation in environments experiencing frequent errors can be burdened with extra cost in terms of job completion time. Further, in such environments, a job may never be able to reach completion. Spares will aid in the validation and in the selection of recovery points in systems experiencing randomly distributed errors. Successful job completion in environments of error bursts is performed with the aid of a scheme that identifies reliable data when periodic on-line testing is available. Spares will help identify the boundaries of reliable data. We consider these features in mesh arrays that are used in digital signal processing applications. Preliminary simulations highlight the overhead of our schemes in terms of job completion times in environments burdened with transient errors
  • Keywords
    VLSI; digital signal processing chips; fault diagnosis; fault tolerant computing; multiprocessing systems; parallel architectures; system recovery; dedicated spares; digital signal processing applications; error bursts; error recovery capability; fault tolerance; idling spares; job completion time; mesh arrays; processing arrays; randomly distributed errors; recovery points; single-package arrays; transient errors; Circuit faults; Computer errors; Costs; Digital signal processing; Fault diagnosis; Fault tolerance; Hardware; Manufacturing; Performance evaluation; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Defect and Fault Tolerance in VLSI Systems, 1996. Proceedings., 1996 IEEE International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1550-5774
  • Print_ISBN
    0-8186-7545-4
  • Type

    conf

  • DOI
    10.1109/DFTVS.1996.572039
  • Filename
    572039