مرکز منطقه ای اطلاع رساني علوم و فناوري - Peer-to-peer fault tolerance framework for a grid computing system

DocumentCode :

2896415

Title :

Peer-to-peer fault tolerance framework for a grid computing system

Author :

Tangmankhong, Thagorn ; Siripongwutikorn, Peerapon ; Achalakul, Tiranee

Author_Institution :

Dept. of Comput. Eng., King´´s Mongkut Univ. of Technol. Thonburi, Thonburi, Thailand

fYear :

2012

fDate :

May 30 2012-June 1 2012

Firstpage :

379

Lastpage :

384

Abstract :

A grid computing system provides high performance computing power, large storage space, or high communication bandwidth, to suit user requirements. The major concern in a grid computing system is the reliability, as a single node failure fails all running applications on the node. We proposed a fault-tolerance framework to improve the reliablity of a grid system. The proposed framework is novel in the sense that it uses the peer-to-peer replication model instead of a traditional client-server replication model, which reduces the replication time overhead and provides better degree of resiliency. Essentially, the checkpoint data file is split into chunks and distributed among a number of backup peers in parallel such that each chunk is replicated at two backup nodes. Moreover, the survival of the backup with the backup data redundancy in case of any one of the backup nodes in the group fails is also maintained. Detailed algorithms of modules of the complete framework are provided including group-forming, fault detection, replication, and fault recovery. Comparative performance evaluation of the replication time between the proposed peer-to-peer model and the client-server model has been conducted by using simulation over a wide range of chunk sizes and checkpoint data size. Our results show that, for a large enough chunk size, the replication time of the peer-to-peer replication model is reduced by half compared to that of the client-server model.

Keywords :

computer network performance evaluation; computer network reliability; fault tolerant computing; grid computing; peer-to-peer computing; P2P computing; backup nodes; backup peers; checkpoint data file size; chunk sizes; communication bandwidth; fault detection; fault recovery; grid computing system; grid system reliability improvement; group-forming; high-performance computing; node failure; peer-to-peer fault tolerance framework; peer-to-peer replication model; replication time overhead; replication time performance evaluation; resiliency degree; storage space; Computational modeling; Fault detection; Fault tolerance; Fault tolerant systems; Peer to peer computing; Throughput; Fault tolerance; Grid reliability; P2P backup;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Science and Software Engineering (JCSSE), 2012 International Joint Conference on

Conference_Location :

Bangkok

Print_ISBN :

978-1-4673-1920-1

Type :

conf

DOI :

10.1109/JCSSE.2012.6261983

Filename :

6261983

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2896415