DocumentCode
1759682
Title
Efficient Similarity Join Based on Earth Mover’s Distance Using MapReduce
Author
Jia Xu ; Bin Lei ; Yu Gu ; Winslett, Marianne ; Ge Yu ; Zhenjie Zhang
Author_Institution
Guangxi Univ., Nanning, China
Volume
27
Issue
8
fYear
2015
fDate
Aug. 1 2015
Firstpage
2148
Lastpage
2162
Abstract
Earth Mover´s Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMD-based similarity join retrieves pairs of probability distributions with EMD below a specified threshold, supporting many important applications, such as duplicate image retrieval and sensor pattern recognition. This paper studies the possibility of using MapReduce to improve the scalability of EMD similarity join. While existing MapReduce optimization techniques mainly aim to minimize the communication overhead, such methods are not applicable to our problem, due to the high computational cost of EMD. Utilizing the dual-program mapping technique, we present a new general data partition framework to facilitate effective workload decomposition using MapReduce, ensuring similar distributions in terms of EMD are mapped to the same reduce task for further verification. New optimization strategies are also proposed to balance the workloads among reduce tasks and eliminate large unnecessary EMD evaluations. Our experiments verify the superiority of our proposal on system efficiency, with a huge advantage of at least one order of magnitude than the state-of-the-art solution, and on system effectiveness, with a real case study towards the abused image phenomenon on the most popular C2C Web site in China.
Keywords
Web sites; data handling; electronic commerce; image processing; linear programming; parallel processing; resource allocation; statistical distributions; C2C Web site; China; EMD-based similarity join; MapReduce; abused image phenomenon; dual-program mapping technique; earth mover´s distance; general data partition framework; human similarity perception; optimization strategies; probability distribution; reduce task; robust measure; scalability improvement; similar distributions; similarity functions; workload balancing; workload decomposition; Earth; Histograms; Measurement; Optimization; Partitioning algorithms; Scalability; Silicon; Earth Mover’s Distance; Earth Mover???s Distance; MapReduce; Probabilistic distributions; parallel computing;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2015.2411281
Filename
7056549
Link To Document