Sampling-Based Partitioning in MapReduce for Skewed Data

Author

Xu, Yujie ; Zou, Peng ; Qu, Wenyu ; Li, Zhiyang ; Li, Keqiu ; Cui, Xiaoli

Author_Institution

Coll. of Inf. Sci. & Technol., Dalian Maritime Univ., Dalian, China

fYear

2012

fDate

20-23 Sept. 2012

Firstpage

1

Lastpage

8

Abstract

MapReduce, as a popular tool for distributed and scalable processing of voluminous data, has been used in many areas. However, it is not efficient when handing skewed data, since it only considers the key and adopts a uniform hash method to distribute the workload to each reducer, while ignores the key\´s distribution. This can lead to load imbalance, increase the processing time, generate the "straggler" and the final result is the performance degradation. The current approach to solve this problem usually adopts the asynchronous Map and Reduce to gather the distribution of keys\´ frequencies and make a partition scheme in advance, but it will cost too much waiting time. In this paper, we address the problem of how to efficiently and effectively partition the intermediate key to balance the load of each reducer when skewed data exists. We use a sampling MapReduce job to gather the distribution of keys\´frequencies, estimate the overall distribution and make a partition scheme in advance. Then, we apply it to the map phase of the expected MapReduce job. This design not only provides a load-balanced partition scheme, but also keeps the high performance of synchronous mode in MapReduce. We also propose two partition schemes based on the sampling results in this paper: cluster combination optimization and cluster partition combination. The experimental results show that the first partition scheme is suitable for the data set that has a lighter skew, while cluster partition combination has a greater time and load balancing advantage when the data skew is heavier.

Keywords

file organisation; parallel processing; resource allocation; sampling methods; MapReduce; asynchronous map; cluster combination optimization; cluster partition combination; data processing; key frequency distribution; load-balanced partition scheme; parallel massive data set processing; sampling-based partitioning scheme; skewed data; uniform hash method; Data models; Distributed databases; Educational institutions; Load management; Monitoring; Optimization; Standards; MapReduce; data skew; partitioning; sampling;

fLanguage

English

Publisher

ieee

Conference_Titel

ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh

Conference_Location

Beijing

Print_ISBN

978-1-4673-2623-0

Electronic_ISBN

978-0-7695-4816-6

Type

conf

DOI

10.1109/ChinaGrid.2012.18

Filename

6337308