DocumentCode :
2632777
Title :
On integrating error detection into a fault diagnosis algorithm for massively parallel computers
Author :
Altmann, Jörn ; Bartha, Tamas ; Pataricza, András
Author_Institution :
Dept. of Comput. Sci., Erlangen-Nurnberg Univ., Germany
fYear :
1995
fDate :
24-26 Apr 1995
Firstpage :
154
Lastpage :
164
Abstract :
Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. We introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like ⟨I´m alive⟩ messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by ⟨I´m alive⟩, messages, on the application performance
Keywords :
computer testing; distributed algorithms; error detection; fault location; multiprocessing systems; parallel machines; I´m alive messages; application performance; built in hardware mechanisms; error detection integration; event-driven distributed system-level diagnosis algorithm; fault diagnosis algorithm; fault localization; fault tolerance mechanisms; general diagnosis model; large massively parallel multiprocessor systems; massively parallel computers; messages; program modules; scalable fault diagnosis; simultaneously existing faults; test results; Application software; Clustering algorithms; Computer errors; Concurrent computing; Fault detection; Fault diagnosis; Fault tolerant systems; Hardware; Instruments; Scalability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Performance and Dependability Symposium, 1995. Proceedings., International
Conference_Location :
Erlangen
Print_ISBN :
0-8186-7059-2
Type :
conf
DOI :
10.1109/IPDS.1995.395836
Filename :
395836
Link To Document :
بازگشت