DocumentCode
2298416
Title
Near duplicate detection using MapReduce
Author
Qinsheng Du ; Wei Liu ; Guolin Li ; Yonglin Tang
Author_Institution
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
fYear
2012
fDate
29-31 Dec. 2012
Firstpage
243
Lastpage
246
Abstract
In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective.
Keywords
parallel processing; text analysis; vectors; MapReduce; cluster computing; near duplicate detection; similarity degree; similarity threshold; text dataset; vector based algorithm; MapReduce; near duplicate detection;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
Conference_Location
Changchun
Print_ISBN
978-1-4673-2963-7
Type
conf
DOI
10.1109/ICCSNT.2012.6525930
Filename
6525930
Link To Document