مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient rollback-recovery technique in distributed computing systems

DocumentCode :

1057922

Title :

Efficient rollback-recovery technique in distributed computing systems

Author :

Chiu, Ge-Ming ; Young, Cheng-Ru

Author_Institution :

Dept. of Electr. Eng. & Technol., Nat. Taiwan Inst. of Technol., Taipei, Taiwan

Volume :

Issue :

fYear :

1996

fDate :

6/1/1996 12:00:00 AM

Firstpage :

565

Lastpage :

577

Abstract :

We propose an approach for implementing rollback recovery in a distributed computing system. A concept of logical ring is introduced for the maintenance of information required for consistent recovery from a system crash. Message processing order of a process is kept by all other processes on its logical ring. Transmission of data messages are accompanied by the circulation of the associated order messages on the ring. The sizes of the order messages are small. In addition, redundant transmission of order information is avoided, thereby reducing the communication overhead incurred during failure free operation. Furthermore, updating of the order information and garbage collection task are simplified in the proposed mechanism. Our approach does not require information about message processing order be written to stable storage; in fact, the time consuming operations of saving information in stable storage are confined to the checkpointing activities. When failures occur, a surviving process need roll back only if some preceding order information is totally lost, which is relatively unlikely considering the ever growing speed of communication networks. It is shown that a system can recover correctly as long as there exists at least one surviving process

Keywords :

distributed processing; fault tolerant computing; message passing; system recovery; communication networks; consistent recovery; data messages; distributed computing systems; failure free operation; garbage collection task; logical ring; message processing order; preceding order information; rollback recovery technique; surviving process; system crash; Checkpointing; Communication networks; Computer Society; Computer crashes; Computer networks; Distributed computing; Electronic switching systems; Fault tolerance; Fault tolerant systems; Very large scale integration;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/71.506695

Filename :

506695

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1057922