Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Author

Islam, Nusrat Sharmin ; Xiaoyi Lu ; Wasi-ur-Rahman, Md ; Panda, Dhabaleswar K.

Author_Institution

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear

2013

fDate

21-23 Aug. 2013

Firstpage

Lastpage

Abstract

The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.

Keywords

IP networks; computer network performance evaluation; fault tolerant computing; network operating systems; peripheral interfaces; pipeline processing; replicated databases; 10 GigE; Apache Hadoop Distributed File System; HBase Put operation performance enhancement; HDFS fault-tolerance; HDFS reliability; IP over InfiniBand; IPoIB; RDMA-based HDFS design; TeraGen benchmark; TestDFSIO benchmark; availability guarantee; big-data applications; data block replication; data nodes; execution time reduction; high-performance interconnects; high-performance networks; parallel replication time; pipelined replication scheme; real-time latency-sensitive applications; remote direct memory access; socket-based HDFS design; throughput; Benchmark testing; Data handling; Data storage systems; File systems; Information management; Protocols; Throughput; Big Data; HDFS; High Performance Interconnects; Replication;

fLanguage

English

Publisher

ieee

Conference_Titel

High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on

Conference_Location

San Jose, CA

Type

conf

DOI

10.1109/HOTI.2013.24

Filename

6627739

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3471587