DocumentCode
682151
Title
Scalable load balancing for mapreduce-based record linkage
Author
Wei Yan ; Yuan Xue ; Malin, Bradley
Author_Institution
Dept. of Electr. Eng. & Comput. Sci., Vanderbilt Univ., Nashville, TN, USA
fYear
2013
fDate
6-8 Dec. 2013
Firstpage
1
Lastpage
10
Abstract
Recent research has introduced load balancing schemes that are aware of the input data distribution (i.e., data profile) to mitigate data skew and fully exploit the parallel capability of the MapReduce framework to support record linkage. However, existing solutions face a significant scalability issue when applied to massive data sets with millions or billions of blocks (a basic unit in record linkage) because their data profiles can not be maintained precisely in an efficient manner. The goal of this paper is to introduce a profiling method based on the notion of a sketch, which allows for a compact scalable solution for maintaining block size statistics. In addition, we propose two load balancing algorithms to work over sketch-based profiles while solving the data skew problem associated with record linkage. We provide an analytical analysis and extensive experiments (using Hadoop), with real and controlled synthetic data sets, to illustrate the effectiveness of our solution. The experimental results show that our load balancing algorithms can decrease the overall job completion time by 71.56% and 70.73% of the default settings in Hadoop using a set of DBLP data sets, which have 2.5 to 50.4 million records.
Keywords
data handling; resource allocation; statistics; MapReduce-based record linkage; analytical analysis; block size statistics; data skew; input data distribution; scalability issue; scalable load balancing; Algorithm design and analysis; Arrays; Couplings; Indexes; Load management; Radiation detectors; Vectors; Load Balance; MapReduce; Record Linkage; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Performance Computing and Communications Conference (IPCCC), 2013 IEEE 32nd International
Conference_Location
San Diego, CA
Print_ISBN
978-1-4799-3213-9
Type
conf
DOI
10.1109/PCCC.2013.6742785
Filename
6742785
Link To Document