• DocumentCode
    3588749
  • Title

    RHJoin: A fast and space-efficient join method for log processing in MapReduce

  • Author

    Dixin Tang ; Taoying Liu ; Hong Liu ; Wei Li

  • Author_Institution
    Inst. of Comput. Technol., Beijing, China
  • fYear
    2014
  • Firstpage
    975
  • Lastpage
    980
  • Abstract
    Equi-join is heavily used in MapReduce-based log processing. With the rapid growth of dataset sizes, join methods on MapReduce are extensively studied recently. We find that existing join methods usually cannot get high query performance and affordable storage consumption at the same time when faced with a huge amount of log data. They either only optimize one aspect but significantly sacrifice the other or have limited applications. In this paper, after analyzing characteristics of the workloads and underlying MapReduce, we present a join method with specific optimizations for log processing called RHJoin (Repartition Hash Join) and its implementation on Hadoop. In RHJoin, reference tables are partitioned in the pre-processing step, the log table is partitioned on the map side and hash join is executed on the reduce side. The shuffle procedure of MapReduce is also optimized by removing the sort step and overlapping the execution of mappers and reducers. Comprehensive experiments show that RHJoin achieves high query performance with only a small extra storage cost, and has wide application circumstances for log processing.
  • Keywords
    data handling; parallel programming; MapReduce shuffle procedure; MapReduce-based log processing; RHJoin method; log data; query performance; repartition hash join method; storage consumption; Scalability; Big data; Join; Log Processing; MapReduce;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2014 20th IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/PADSW.2014.7097918
  • Filename
    7097918