DocumentCode :
1759682
Title :
Efficient Similarity Join Based on Earth Mover’s Distance Using MapReduce
Author :
Jia Xu ; Bin Lei ; Yu Gu ; Winslett, Marianne ; Ge Yu ; Zhenjie Zhang
Author_Institution :
Guangxi Univ., Nanning, China
Volume :
27
Issue :
8
fYear :
2015
fDate :
Aug. 1 2015
Firstpage :
2148
Lastpage :
2162
Abstract :
Earth Mover´s Distance (EMD) evaluates the similarity between probability distributions, known as a robust measure more consistent with human similarity perception than traditional similarity functions. EMD-based similarity join retrieves pairs of probability distributions with EMD below a specified threshold, supporting many important applications, such as duplicate image retrieval and sensor pattern recognition. This paper studies the possibility of using MapReduce to improve the scalability of EMD similarity join. While existing MapReduce optimization techniques mainly aim to minimize the communication overhead, such methods are not applicable to our problem, due to the high computational cost of EMD. Utilizing the dual-program mapping technique, we present a new general data partition framework to facilitate effective workload decomposition using MapReduce, ensuring similar distributions in terms of EMD are mapped to the same reduce task for further verification. New optimization strategies are also proposed to balance the workloads among reduce tasks and eliminate large unnecessary EMD evaluations. Our experiments verify the superiority of our proposal on system efficiency, with a huge advantage of at least one order of magnitude than the state-of-the-art solution, and on system effectiveness, with a real case study towards the abused image phenomenon on the most popular C2C Web site in China.
Keywords :
Web sites; data handling; electronic commerce; image processing; linear programming; parallel processing; resource allocation; statistical distributions; C2C Web site; China; EMD-based similarity join; MapReduce; abused image phenomenon; dual-program mapping technique; earth mover´s distance; general data partition framework; human similarity perception; optimization strategies; probability distribution; reduce task; robust measure; scalability improvement; similar distributions; similarity functions; workload balancing; workload decomposition; Earth; Histograms; Measurement; Optimization; Partitioning algorithms; Scalability; Silicon; Earth Mover’s Distance; Earth Mover???s Distance; MapReduce; Probabilistic distributions; parallel computing;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2015.2411281
Filename :
7056549
Link To Document :
بازگشت