DocumentCode
2062067
Title
Functional Tests of the RADIC Fault Tolerance Architecture
Author
Duarte, Angelo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona
fYear
2007
fDate
7-9 Feb. 2007
Firstpage
278
Lastpage
287
Abstract
Clusters with thousand of nodes are a reality and the current trend indicates that they are becoming larger. Such large clusters are subject to a relatively high fault frequency so a fault-tolerance scheme is mandatory to assure the correct application completion. Message passing is the programming model often used in large clusters and the current implementations used to achieve fault tolerance in message passing systems do not focus in an architecture that simultaneously attends to scalability, transparency and independence of stable/central elements. The RADIC architecture was proposed and design as a fully distributed structure in order to achieve such requirements. Such architecture defines a fully distributed fault tolerance controller implemented by a set of system processes, which collaborate in order to perform all the basic functions of a fault tolerance protocol. This paper presents the test methodology used to verify the functionality of the RADIC architecture using RADICMPI, a prototype on the MPI semantic
Keywords
fault tolerant computing; message passing; parallel architectures; RADIC fault tolerance architecture; RADICMPI; fully distributed structure; functional tests; message passing; programming model; Collaboration; Control systems; Distributed control; Fault tolerance; Fault tolerant systems; Frequency; Message passing; Protocols; Scalability; Testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on
Conference_Location
Napoli
ISSN
1066-6192
Print_ISBN
0-7695-2784-1
Type
conf
DOI
10.1109/PDP.2007.45
Filename
4135288
Link To Document