• DocumentCode
    3471587
  • Title

    Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

  • Author

    Islam, Nusrat Sharmin ; Xiaoyi Lu ; Wasi-ur-Rahman, Md ; Panda, Dhabaleswar K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2013
  • fDate
    21-23 Aug. 2013
  • Firstpage
    75
  • Lastpage
    78
  • Abstract
    The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The current implementation of HDFS in Apache Hadoop performs replication in a pipelined fashion resulting in higher replication times. Such large replication times adversely impact the performance of real-time, latency-sensitive applications. In this paper, we propose an alternative parallel replication scheme applicable to both the socket-based design of HDFS and the RDMA-based design of HDFS over InfiniBand. We analyze the challenges and issues in parallel replication and compare its performance with the existing pipelined replication scheme in HDFS over 1 GigE, IPoIB (IP over InfiniBand), 10 GigE and RDMA (Remote Direct Memory Access) over InfiniBand. Experiments performed over high performance networks (IPoIB, 10 GigE, and IB) show that the proposed parallel replication scheme is able to outperform the default pipelined design for a variety of benchmarks. We observe up to a 16% reduction in the execution time of the TeraGen benchmark. We are also able to increase the throughput reported by the TestDFSIO benchmark by up to 12%. The proposed parallel replication is also able to enhance the HBase Put operation performance by 17%. However, for lower performance networks like 1GigE and smaller data sizes, parallel replication does not benefit the performance.
  • Keywords
    IP networks; computer network performance evaluation; fault tolerant computing; network operating systems; peripheral interfaces; pipeline processing; replicated databases; 10 GigE; Apache Hadoop Distributed File System; HBase Put operation performance enhancement; HDFS fault-tolerance; HDFS reliability; IP over InfiniBand; IPoIB; RDMA-based HDFS design; TeraGen benchmark; TestDFSIO benchmark; availability guarantee; big-data applications; data block replication; data nodes; execution time reduction; high-performance interconnects; high-performance networks; parallel replication time; pipelined replication scheme; real-time latency-sensitive applications; remote direct memory access; socket-based HDFS design; throughput; Benchmark testing; Data handling; Data storage systems; File systems; Information management; Protocols; Throughput; Big Data; HDFS; High Performance Interconnects; Replication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on
  • Conference_Location
    San Jose, CA
  • Type

    conf

  • DOI
    10.1109/HOTI.2013.24
  • Filename
    6627739