Title :
Recovery schemes for mesh arrays utilizing dedicated spares
Author :
Goldberg, S.R. ; Upadhyaya, S.J. ; Fuchs, W.K.
Author_Institution :
Dept. of Electr. & Comput. Eng., State Univ. of New York, Buffalo, NY, USA
Abstract :
Error recovery capability is examined in processing arrays that employ spare nodes for fault tolerance. Spares can provide fault tolerance to high-performance single-package arrays, where it is not feasible to repair faulty subsystems. The cost of such a fault-tolerance solution, redundant hardware that idles until needed, may not be practical. Manufacturers must be offered hardware solutions to fault tolerance that provide useful work at all times. In this paper, new schemes are presented in which idling spares can be utilized to improve error recovery. Without expedient error recovery, computation in environments experiencing frequent errors can be burdened with extra cost in terms of job completion time. Further, in such environments, a job may never be able to reach completion. Spares will aid in the validation and in the selection of recovery points in systems experiencing randomly distributed errors. Successful job completion in environments of error bursts is performed with the aid of a scheme that identifies reliable data when periodic on-line testing is available. Spares will help identify the boundaries of reliable data. We consider these features in mesh arrays that are used in digital signal processing applications. Preliminary simulations highlight the overhead of our schemes in terms of job completion times in environments burdened with transient errors
Keywords :
VLSI; digital signal processing chips; fault diagnosis; fault tolerant computing; multiprocessing systems; parallel architectures; system recovery; dedicated spares; digital signal processing applications; error bursts; error recovery capability; fault tolerance; idling spares; job completion time; mesh arrays; processing arrays; randomly distributed errors; recovery points; single-package arrays; transient errors; Circuit faults; Computer errors; Costs; Digital signal processing; Fault diagnosis; Fault tolerance; Hardware; Manufacturing; Performance evaluation; Testing;
Conference_Titel :
Defect and Fault Tolerance in VLSI Systems, 1996. Proceedings., 1996 IEEE International Symposium on
Conference_Location :
Boston, MA
Print_ISBN :
0-8186-7545-4
DOI :
10.1109/DFTVS.1996.572039