DocumentCode :
2298416
Title :
Near duplicate detection using MapReduce
Author :
Qinsheng Du ; Wei Liu ; Guolin Li ; Yonglin Tang
Author_Institution :
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
fYear :
2012
fDate :
29-31 Dec. 2012
Firstpage :
243
Lastpage :
246
Abstract :
In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective.
Keywords :
parallel processing; text analysis; vectors; MapReduce; cluster computing; near duplicate detection; similarity degree; similarity threshold; text dataset; vector based algorithm; MapReduce; near duplicate detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
Conference_Location :
Changchun
Print_ISBN :
978-1-4673-2963-7
Type :
conf
DOI :
10.1109/ICCSNT.2012.6525930
Filename :
6525930
Link To Document :
بازگشت