DocumentCode :
1666622
Title :
Stable checkpointing in distributed systems without shared disks
Author :
Sobe, Peter
Author_Institution :
Inst. of Comput. Eng., Lubeck Univ., Germany
fYear :
2003
Abstract :
Interacting processes an distributed systems save their checkpoints on local disks for efficiency reasons. But, because local checkpoints get unavailable with failing hosts, redundancy schemes similar to RAID-like storage schemes have to be used. In such systems, checkpoints are stable under a particular fault model because they can get reconstructed in the distributed system. In this paper, two variants of stable checkpoint storage are compared, (a) parity grouping over local checkpoints and (ii) RAID-like distribution of each checkpoint using a software based distributed storage system. An analysis is given to compare costs for collective checkpoint creation, recovery of a single process and rollback of all processes. The results show that despite the differences in detail, checkpointing using a distributed storage system is a reasonable solution.
Keywords :
RAID; distributed processing; system recovery; RAID-like storage schemes; distributed systems; interacting processes; parity grouping; redundancy schemes; software based distributed storage system; stable checkpointing; Checkpointing; Computer crashes; Costs; Delay; Distributed computing; Fault tolerance; Protocols; Redundancy; Workstations; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2003. Proceedings. International
ISSN :
1530-2075
Print_ISBN :
0-7695-1926-1
Type :
conf
DOI :
10.1109/IPDPS.2003.1213392
Filename :
1213392
Link To Document :
بازگشت