مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault tolerant memory design for HW/SW co-reliability in massively parallel computing systems

DocumentCode :

3474744

Title :

Fault tolerant memory design for HW/SW co-reliability in massively parallel computing systems

Author :

Choi, M. ; Park, N.J. ; George, K.M. ; Jin, B. ; Park, N. ; Kim, Y.B. ; Lombardi, F.

Author_Institution :

Dept. of Electr. & Comput. Eng., Missouri Univ., Rolla, MO, USA

fYear :

2003

fDate :

16-18 April 2003

Firstpage :

341

Lastpage :

348

Abstract :

A highly dependable embedded fault-tolerant memory architecture for high performance massively parallel computing applications and its dependability assurance techniques are proposed and discussed in this paper. The proposed fault tolerant memory provides two distinctive repair mechanisms: the permanent laser redundancy reconfiguration during the wafer probe stage in the factory to enhance its manufacturing yield and the dynamic BIST/BISD/BISR (built-in-self-test-diagnosis-repair)-based reconfiguration of the redundant resources in field to maintain high field reliability. The system reliability which is mainly determined by hardware configuration demanded by software and field reconfiguration/repair utilizing unused processor and memory modules is referred to as HW/SW Co-reliability. Various system configuration options in terms of parallel processing unit size and processor/memory intensity are also introduced and their HW/SW Co-reliability characteristics are discussed. A modeling and assurance technique for HW/SW Co-reliability with emphasis on the dependability assurance techniques based on combinatorial modeling suitable for the proposed memory design is developed and validated by extensive parametric simulations. Thereby, design and Implementation of memory-reliability-optimized and highly reliable fault-tolerant field reconfigurable massively parallel computing systems can be achieved.

Keywords :

SRAM chips; built-in self test; fault tolerant computing; parallel architectures; parallel memories; redundancy; built-in-self-test-diagnosis-repair-based reconfiguration; dependability assurance techniques; high performance massively parallel computing applications; highly dependable embedded fault-tolerant memory architecture; permanent laser redundancy reconfiguration; repair mechanisms; system reliability; Built-in self-test; Fault tolerance; Fault tolerant systems; Maintenance; Manufacturing; Memory architecture; Parallel processing; Probes; Production facilities; Redundancy;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Network Computing and Applications, 2003. NCA 2003. Second IEEE International Symposium on

Print_ISBN :

0-7695-1938-5

Type :

conf

DOI :

10.1109/NCA.2003.1201173

Filename :

1201173

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3474744