مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault Tolerance in Cluster Computing System

DocumentCode :

3361095

Title :

Fault Tolerance in Cluster Computing System

Author :

Patil, Ashwini ; Shah, Ankit ; Gaikwad, Sheetal ; Mishra, Akassh A. ; Kohli, Simranjit Singh ; Dhage, Sudhir

Author_Institution :

Dept. of Comput. Eng., Univ. of Mumbai, Mumbai, India

fYear :

2011

fDate :

26-28 Oct. 2011

Firstpage :

408

Lastpage :

412

Abstract :

With advancement in technology, the needs for high performance computing are increasing tremendously. Cluster computing has developed due to the availability of high performance cost effective processors and high speed networks. The long-term trend in High performance computing requires increasing number of nodes in parallel computing platforms. This however entails a higher failure probability. The Message Passing Paradigm (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may get stopped at any time due to unpredictable failures during execution. In our paper we propose an efficient fault tolerant approach for MPI system in an asymmetric cluster computing environment. In this paper, we use centralized logging process. In the approach proposed, we use message logging for message losses. The process has three main parts failure detection, failure recovery and overload detection. Our System maintains monitor nodes for all nodes in cluster, the difference being all monitor nodes can work as a cluster node even when the system is functioning properly and not just at the time of node failure.

Keywords :

application program interfaces; fault tolerant computing; message passing; parallel processing; system recovery; workstation clusters; MPI applications; asymmetric cluster computing environment; centralized logging process; cluster computing system; communication library; failure probability; failure recovery; fault tolerant approach; high performance computing; message logging; message passing paradigm; parallel computing platforms; programming paradigm; Fault tolerance; Fault tolerant systems; Heart beat; Load management; Message passing; Monitoring; Servers; assymetric cluster; failure recovery; fault tolerance;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2011 International Conference on

Conference_Location :

Barcelona

Print_ISBN :

978-1-4577-1448-1

Type :

conf

DOI :

10.1109/3PGCIC.2011.77

Filename :

6154915

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3361095