DocumentCode
1886347
Title
Compiler assisted fault detection for distributed-memory systems
Author
Gong, Chun ; Melhem, Rani ; Gupta, Rajiv
Author_Institution
Dept. of Comput. Sci., Pittsburgh Univ., PA, USA
fYear
1994
fDate
23-25 May 1994
Firstpage
373
Lastpage
380
Abstract
Distributed-memory systems provide the most promising performance to cost ratio for multiprocessor computers due to their scalability. However the issues of fault detection and fault tolerance are critical in such systems since the probability of having faulty components increases with the number of processors. We propose a methodology for fault detection through compiler support. More specifically, we augment the single-program multiple-data (SPMD) execution model to duplicate selected data items in such a way that during execution, whenever a value of a duplicated data is computed, the owners of the data are tested. The proposed compiler assisted fault detection technique does not require any specialized hardware and allows for a selective choice of redundancy at compile time
Keywords
computer debugging; distributed memory systems; fault tolerant computing; program compilers; reliability; software reliability; compile time; compiler assisted fault detection; data item duplication; distributed-memory systems; fault tolerance; multiprocessor computers; performance to cost ratio; probability; redundancy; scalability; single-program multiple-data execution model; specialized hardware; Computer science; Costs; Distributed computing; Fault detection; Fault tolerance; Fault tolerant systems; Hardware; Multiprocessing systems; Redundancy; Testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Scalable High-Performance Computing Conference, 1994., Proceedings of the
Conference_Location
Knoxville, TN
Print_ISBN
0-8186-5680-8
Type
conf
DOI
10.1109/SHPCC.1994.296667
Filename
296667
Link To Document