مرکز منطقه ای اطلاع رساني علوم و فناوري - A Study of Effective Replica Reconstruction Schemes at Node Deletion for HDFS

Abstract :

Distributed file systems, which manage large amounts of data over multiple commercially available machines, have attracted attention as a management and processing system for big data applications. A distributed file system consists of multiple data nodes and provides reliability and availability by holding multiple replicas of data. Due to system failure or maintenance, a data node may be removed from the system and the data blocks the removed data node held are lost. If data blocks are missing, the access load of the other data nodes that hold the lost data blocks increases, and as a result the performance of data processing over the distributed file system decreases. Therefore, replica reconstruction is an important issue to reallocate the missing data blocks in order to prevent such performance degradation. The Hadoop Distributed File System (HDFS) is a widely used distributed file system. In the HDFS replica reconstruction process, source and destination data nodes for replication are selected randomly. We found that this replica reconstruction scheme is inefficient because data transfer is biased. Therefore, we propose two more effective replica reconstruction schemes that aim to balance the workloads of replication processes. Our proposed replication scheduling strategy assumes that nodes are arranged in a ring and data blocks are transferred based on this one-directional ring structure to minimize the difference of the amount of transfer data of each node. Based on this strategy, we propose two replica reconstruction schemes, an optimization scheme and a heuristic scheme. We have implemented the proposed schemes in HDFS and evaluated them on an actual HDFS cluster. From the experiments, we confirm that the replica reconstruction throughput of the proposed schemes show a 45% improvement compared to that of the default scheme. We also verify that the heuristic scheme is effective because it shows performance comparable to the optimization scheme and can be mo- e scalable than the optimization scheme.

Keywords :

Big Data; file organisation; optimisation; HDFS; Hadoop distributed file system; access load; data transfer; heuristic scheme; node deletion; optimization scheme; replica reconstruction scheme; replication scheduling strategy; Availability; Big data; Data transfer; Distributed databases; Optimization; Structural rings; Throughput; HDFS; distributed file system; heuristic; optimization; reconstruction; replica;