Title :
Transparent Fault Tolerance Solution at Socket Level Based on RADIC
Author :
Castro, Marcela ; Rexachs, Dolores ; Luque, Emilio
Author_Institution :
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain
Abstract :
We present a transparent middleware for fault tolerance based on RADIC, Redundant Array of Distributed Independent Controllers, a transparent and scalable fault tolerant architecture for parallel applications. It is designed at socket level and makes a secure tunnel connection able to keep the tcp sessions established by the application in spite of node failures. It is located at user level and is independent of the message-passing communication library being used. The protection gets through uncoordinated checkpoints and log message and the recovery are done in a automatic way so in case of node failures there is no need of intervention of the administrator. We have tested our fault tolerance system by executing a master-worker (M/W) and SPMD applications that follow different communication patterns.
Keywords :
distributed processing; fault tolerant computing; message passing; middleware; M/W; RADIC; SPMD applications; master worker; message passing communication library; parallel applications; redundant array of distributed independent controllers; scalable fault tolerant architecture; socket level; transparent fault tolerance solution; transparent fault tolerant architecture; transparent middleware; Computer architecture; Fault tolerance; Fault tolerant systems; Libraries; Observers; Sockets; Fault-tolerance; High-Availability; MPI; RADIC; parallel computing;
Conference_Titel :
Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on
Conference_Location :
Leganes
Print_ISBN :
978-1-4673-1631-6
DOI :
10.1109/ISPA.2012.121