• DocumentCode
    3077036
  • Title

    Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

  • Author

    Islam, Nusrat Sharmin ; Xiaoyi Lu ; Wasi-ur-Rahman, M. ; Shankar, Dipti ; Panda, Dhabaleswar K.

  • fYear
    2015
  • fDate
    4-7 May 2015
  • Firstpage
    101
  • Lastpage
    110
  • Abstract
    HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the Cloudburst application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.
  • Keywords
    Big Data; distributed databases; fault tolerant computing; parallel processing; CloudBurst application; Grep; HDFS; HPC clusters; Hadoop distributed file system; IO bottlenecks; Lustre; PUMA; SequenceCount; Triple-H; big data applications; data locality; data placement policies; fault-tolerance; heterogeneous storage architecture; high performance computing; parallel file system; sort benchmark; trireplicated data blocks; Engines; Fault tolerance; Fault tolerant systems; File systems; Performance evaluation; Random access memory; Servers; Big Data; HDFS; HPC; Heterogeneous Storage;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Type

    conf

  • DOI
    10.1109/CCGrid.2015.161
  • Filename
    7152476