• DocumentCode
    2593138
  • Title

    A fault tolerant approach in cluster computing system

  • Author

    Shwe, Thanda ; Aye, Win

  • Author_Institution
    Dept. of Inf. Technol., Mandalay Technol. Univ., Mandalay
  • Volume
    1
  • fYear
    2008
  • fDate
    14-17 May 2008
  • Firstpage
    149
  • Lastpage
    152
  • Abstract
    A long-term trend in high performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Hence, fault tolerance becomes a key property for parallel application running on parallel computing systems. The message passing interface (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may be stopped at any time during their execution due to an unpredictable failure. In order to avoid complete restarts of an MPI application because of only one failure, a fault tolerant MPI implementation is essential. In this paper, we propose a fault tolerant approach in cluster computing system. Our approach is based on reassignment of tasks to the remaining system and message logging is used for message losses. This system consists of two main parts, failure diagnosis and failure recovery. Failure diagnosis is the detection of a failure and failure recovery is the action needed to take over the workload of a failed component. This fault tolerant approach is implemented as an extension of the message passing interface.
  • Keywords
    fault tolerant computing; message passing; parallel processing; probability; program diagnostics; software libraries; system recovery; workstation clusters; cluster computing system; communication library; failure diagnosis; failure probability; failure recovery; fault tolerant approach; high performance computing; message logging; message passing interface; parallel computing platforms; Application software; Clustering algorithms; Concurrent computing; Fault tolerance; Fault tolerant systems; Hardware; High performance computing; Message passing; Parallel processing; Parallel programming;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
  • Conference_Location
    Krabi
  • Print_ISBN
    978-1-4244-2101-5
  • Electronic_ISBN
    978-1-4244-2102-2
  • Type

    conf

  • DOI
    10.1109/ECTICON.2008.4600394
  • Filename
    4600394