DocumentCode :
3361095
Title :
Fault Tolerance in Cluster Computing System
Author :
Patil, Ashwini ; Shah, Ankit ; Gaikwad, Sheetal ; Mishra, Akassh A. ; Kohli, Simranjit Singh ; Dhage, Sudhir
Author_Institution :
Dept. of Comput. Eng., Univ. of Mumbai, Mumbai, India
fYear :
2011
fDate :
26-28 Oct. 2011
Firstpage :
408
Lastpage :
412
Abstract :
With advancement in technology, the needs for high performance computing are increasing tremendously. Cluster computing has developed due to the availability of high performance cost effective processors and high speed networks. The long-term trend in High performance computing requires increasing number of nodes in parallel computing platforms. This however entails a higher failure probability. The Message Passing Paradigm (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may get stopped at any time due to unpredictable failures during execution. In our paper we propose an efficient fault tolerant approach for MPI system in an asymmetric cluster computing environment. In this paper, we use centralized logging process. In the approach proposed, we use message logging for message losses. The process has three main parts failure detection, failure recovery and overload detection. Our System maintains monitor nodes for all nodes in cluster, the difference being all monitor nodes can work as a cluster node even when the system is functioning properly and not just at the time of node failure.
Keywords :
application program interfaces; fault tolerant computing; message passing; parallel processing; system recovery; workstation clusters; MPI applications; asymmetric cluster computing environment; centralized logging process; cluster computing system; communication library; failure probability; failure recovery; fault tolerant approach; high performance computing; message logging; message passing paradigm; parallel computing platforms; programming paradigm; Fault tolerance; Fault tolerant systems; Heart beat; Load management; Message passing; Monitoring; Servers; assymetric cluster; failure recovery; fault tolerance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2011 International Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4577-1448-1
Type :
conf
DOI :
10.1109/3PGCIC.2011.77
Filename :
6154915
Link To Document :
بازگشت