DocumentCode
2589573
Title
On the optimum recovery of distributed programs
Author
Silva, Luis Moura ; Silva, João Gabriel
Author_Institution
Coimbra Univ., Portugal
fYear
1994
fDate
5-8 Sep 1994
Firstpage
704
Lastpage
711
Abstract
In a previous paper (1992) the authors have presented a checkpointing algorithm for distributed applications. That algorithm is based on a non-blocking coordinated global checkpoint of the distributed program. However, the associated rollback algorithm does not provide the best results, since in most cases it forces the rollback of all the processes. This paper presents two algorithms for roll-back-recovery that minimize the number of processes which need to roll back. One of the algorithms is oriented to systems that use message logging, while the other is more general and can be used in those systems that only rely an a coordinated checkpoint and do not log messages. We will show that our proposal achieves the optimum results In the minimization of the number of processes that have to roll back
Keywords
parallel algorithms; parallel programming; software fault tolerance; system recovery; checkpointing; distributed algorithms; distributed programs; distributed systems; fault-tolerance; message logging; optimum recovery; roll-back-recovery; vector time; Checkpointing; Computer applications; Computer crashes; Computer hacking; Concurrent computing; Delay; Distributed algorithms; Distributed computing; Fault tolerance; Proposals;
fLanguage
English
Publisher
ieee
Conference_Titel
EUROMICRO 94. System Architecture and Integration. Proceedings of the 20th EUROMICRO Conference.
Conference_Location
Liverpool
Print_ISBN
0-8186-6430-4
Type
conf
DOI
10.1109/EURMIC.1994.390340
Filename
390340
Link To Document