DocumentCode :
6690
Title :
Fault Tolerance in Distributed Systems Using Fused Data Structures
Author :
Balasubramanian, Balamurugan ; Garg, V.K.
Author_Institution :
Dept. of Electr. Eng., Princeton Univ., Princeton, NJ, USA
Volume :
24
Issue :
4
fYear :
2013
fDate :
Apr-13
Firstpage :
701
Lastpage :
715
Abstract :
Replication is the prevalent solution to tolerate faults in large data structures hosted on distributed servers. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. The theoretical and experimental evaluation confirms that the fused backups are space-efficient as compared to replication, while they cause very little overhead for normal operation. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon´s highly available key-value store, Dynamo. While the current replication-based solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.
Keywords :
Java; client-server systems; computational complexity; data structures; fault tolerant computing; replicated databases; Amazon; Byzantine faults; Dynamo; Java collection framework; computational complexity; crash fault tolerance; dead data structure replication; distributed servers; distributed systems; erasure codes; fused data structure backup library; malicious data structures; unresponsive data structure replication; Arrays; Computer crashes; Fault tolerance; Fault tolerant systems; Indexes; Servers; Distributed systems; data structures; fault tolerance;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2012.96
Filename :
6171174
Link To Document :
بازگشت