DocumentCode :
1656426
Title :
Intelligent Similarity Joins for Big Data Integration
Author :
Mian Wang ; Tiezheng Nie ; Derong Shen ; Yue Kou ; Ge Yu
Author_Institution :
Coll. of Inf. Sci. & Eng., Northeastern Univ., Shenyang, China
fYear :
2013
Firstpage :
383
Lastpage :
388
Abstract :
With the increasing amount of data, the record linkage has become a challenge for big data integration. Similarity join is an efficient approach to address the record linkage, but it is hardly achieved by the single node environment. In this paper, we propose a framework based on MapReduce for set similarity join. The techniques of framework improve the efficiency from two aspects: reducing candidate pairs and load balance. In reducing candidate pairs, we propose algorithms that combines multiple filtering principles to reduce the amount of candidate pairs. It includes length filter, prefix filter and position filter. The techniques for load balance are used to address the skew data and decrease the replication transfer volume. Experimental results on real dataset show that our approaches can achieve the speed-up over previous algorithms on big data.
Keywords :
Big Data; data integration; information filtering; resource allocation; Big Data integration; MapReduce; candidate pairs reduction; filtering principles; intelligent similarity join; length filter; load balance; position filter; prefix filter; real dataset; record linkage; replication transfer volume; set similarity join; skew data; Algorithm design and analysis; Data models; Filtering algorithms; Information filters; Information management; MapReduce; load balance; prefix filter; similarity join;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information System and Application Conference (WISA), 2013 10th
Conference_Location :
Yangzhou
Print_ISBN :
978-1-4799-3218-4
Type :
conf
DOI :
10.1109/WISA.2013.79
Filename :
6778670
Link To Document :
بازگشت