• DocumentCode
    1914304
  • Title

    Sampling-Based Partitioning in MapReduce for Skewed Data

  • Author

    Xu, Yujie ; Zou, Peng ; Qu, Wenyu ; Li, Zhiyang ; Li, Keqiu ; Cui, Xiaoli

  • Author_Institution
    Coll. of Inf. Sci. & Technol., Dalian Maritime Univ., Dalian, China
  • fYear
    2012
  • fDate
    20-23 Sept. 2012
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key\´s distribution. This can lead to load imbalance, increase the processing time, generate the "straggler" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys\´ frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys\´frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.
  • Keywords
    file organisation; parallel processing; resource allocation; sampling methods; MapReduce; asynchronous map; cluster combination optimization; cluster partition combination; data processing; key frequency distribution; load-balanced partition scheme; parallel massive data set processing; sampling-based partitioning scheme; skewed data; uniform hash method; Data models; Distributed databases; Educational institutions; Load management; Monitoring; Optimization; Standards; MapReduce; data skew; partitioning; sampling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2623-0
  • Electronic_ISBN
    978-0-7695-4816-6
  • Type

    conf

  • DOI
    10.1109/ChinaGrid.2012.18
  • Filename
    6337308