DocumentCode :
11703
Title :
Flexible Symmetrical Global-Snapshot Algorithms for Large-Scale Distributed Systems
Author :
Jichiang Tsai
Author_Institution :
Dept. of Electr. Eng., Nat. Chung Hsing Univ., Taichung, Taiwan
Volume :
24
Issue :
3
fYear :
2013
fDate :
Mar-13
Firstpage :
493
Lastpage :
505
Abstract :
Most existing global-snapshot algorithms in distributed systems use control messages to coordinate the construction of a global snapshot among all processes. Since these algorithms typically assume the underlying logical overlay topology is fully connected, the number of control messages exchanged among the whole processes is proportional to the square of number of processes, resulting in higher possibility of network congestion. Hence, such algorithms are neither efficient nor scalable for a large-scale distributed system composed of a huge number of processes. Recently, some efforts have been presented to significantly reduce the number of control messages, but doing so incurs higher response time instead. In this paper, we propose an efficient global-snapshot algorithm able to let every process finish its local snapshot in a given number of rounds. Particularly, such an algorithm allows a tradeoff between the response time and the message complexity. Moreover, our global-snapshot algorithm is symmetrical in the sense that identical steps are executed by every process. This means that our algorithm is able to achieve better workload balance and less network congestion. Most importantly, based on our framework, we demonstrate that the minimum number of control messages required by a symmetrical global-snapshot algorithm is Ω(N log N), where N is the number of processes. Finally, we also assume non-FIFO channels.
Keywords :
checkpointing; computational complexity; large-scale systems; message passing; overlay networks; control message exchange; flexible symmetrical global-snapshot algorithms; large-scale distributed systems; logical overlay topology; message complexity; network congestion; nonFIFO channels; response time; workload balance; Algorithm design and analysis; Complexity theory; Hypercubes; Process control; Program processors; Time factors; Vectors; Distributed systems; checkpointing; global snapshots; message passing; process symmetry;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2012.139
Filename :
6197182
Link To Document :
بازگشت