Title :
Peer-to-peer fault tolerance framework for a grid computing system
Author :
Tangmankhong, Thagorn ; Siripongwutikorn, Peerapon ; Achalakul, Tiranee
Author_Institution :
Dept. of Comput. Eng., King´´s Mongkut Univ. of Technol. Thonburi, Thonburi, Thailand
fDate :
May 30 2012-June 1 2012
Abstract :
A grid computing system provides high performance computing power, large storage space, or high communication bandwidth, to suit user requirements. The major concern in a grid computing system is the reliability, as a single node failure fails all running applications on the node. We proposed a fault-tolerance framework to improve the reliablity of a grid system. The proposed framework is novel in the sense that it uses the peer-to-peer replication model instead of a traditional client-server replication model, which reduces the replication time overhead and provides better degree of resiliency. Essentially, the checkpoint data file is split into chunks and distributed among a number of backup peers in parallel such that each chunk is replicated at two backup nodes. Moreover, the survival of the backup with the backup data redundancy in case of any one of the backup nodes in the group fails is also maintained. Detailed algorithms of modules of the complete framework are provided including group-forming, fault detection, replication, and fault recovery. Comparative performance evaluation of the replication time between the proposed peer-to-peer model and the client-server model has been conducted by using simulation over a wide range of chunk sizes and checkpoint data size. Our results show that, for a large enough chunk size, the replication time of the peer-to-peer replication model is reduced by half compared to that of the client-server model.
Keywords :
computer network performance evaluation; computer network reliability; fault tolerant computing; grid computing; peer-to-peer computing; P2P computing; backup nodes; backup peers; checkpoint data file size; chunk sizes; communication bandwidth; fault detection; fault recovery; grid computing system; grid system reliability improvement; group-forming; high-performance computing; node failure; peer-to-peer fault tolerance framework; peer-to-peer replication model; replication time overhead; replication time performance evaluation; resiliency degree; storage space; Computational modeling; Fault detection; Fault tolerance; Fault tolerant systems; Peer to peer computing; Throughput; Fault tolerance; Grid reliability; P2P backup;
Conference_Titel :
Computer Science and Software Engineering (JCSSE), 2012 International Joint Conference on
Conference_Location :
Bangkok
Print_ISBN :
978-1-4673-1920-1
DOI :
10.1109/JCSSE.2012.6261983