• DocumentCode
    6690
  • Title

    Fault Tolerance in Distributed Systems Using Fused Data Structures

  • Author

    Balasubramanian, Balamurugan ; Garg, V.K.

  • Author_Institution
    Dept. of Electr. Eng., Princeton Univ., Princeton, NJ, USA
  • Volume
    24
  • Issue
    4
  • fYear
    2013
  • fDate
    Apr-13
  • Firstpage
    701
  • Lastpage
    715
  • Abstract
    Replication is the prevalent solution to tolerate faults in large data structures hosted on distributed servers. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. The theoretical and experimental evaluation confirms that the fused backups are space-efficient as compared to replication, while they cause very little overhead for normal operation. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon´s highly available key-value store, Dynamo. While the current replication-based solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.
  • Keywords
    Java; client-server systems; computational complexity; data structures; fault tolerant computing; replicated databases; Amazon; Byzantine faults; Dynamo; Java collection framework; computational complexity; crash fault tolerance; dead data structure replication; distributed servers; distributed systems; erasure codes; fused data structure backup library; malicious data structures; unresponsive data structure replication; Arrays; Computer crashes; Fault tolerance; Fault tolerant systems; Indexes; Servers; Distributed systems; data structures; fault tolerance;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.96
  • Filename
    6171174