DocumentCode
2593138
Title
A fault tolerant approach in cluster computing system
Author
Shwe, Thanda ; Aye, Win
Author_Institution
Dept. of Inf. Technol., Mandalay Technol. Univ., Mandalay
Volume
1
fYear
2008
fDate
14-17 May 2008
Firstpage
149
Lastpage
152
Abstract
A long-term trend in high performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Hence, fault tolerance becomes a key property for parallel application running on parallel computing systems. The message passing interface (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may be stopped at any time during their execution due to an unpredictable failure. In order to avoid complete restarts of an MPI application because of only one failure, a fault tolerant MPI implementation is essential. In this paper, we propose a fault tolerant approach in cluster computing system. Our approach is based on reassignment of tasks to the remaining system and message logging is used for message losses. This system consists of two main parts, failure diagnosis and failure recovery. Failure diagnosis is the detection of a failure and failure recovery is the action needed to take over the workload of a failed component. This fault tolerant approach is implemented as an extension of the message passing interface.
Keywords
fault tolerant computing; message passing; parallel processing; probability; program diagnostics; software libraries; system recovery; workstation clusters; cluster computing system; communication library; failure diagnosis; failure probability; failure recovery; fault tolerant approach; high performance computing; message logging; message passing interface; parallel computing platforms; Application software; Clustering algorithms; Concurrent computing; Fault tolerance; Fault tolerant systems; Hardware; High performance computing; Message passing; Parallel processing; Parallel programming;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location
Krabi
Print_ISBN
978-1-4244-2101-5
Electronic_ISBN
978-1-4244-2102-2
Type
conf
DOI
10.1109/ECTICON.2008.4600394
Filename
4600394
Link To Document