Near duplicate detection using MapReduce

Author

Qinsheng Du ; Wei Liu ; Guolin Li ; Yonglin Tang

Author_Institution

Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China

fYear

2012

fDate

29-31 Dec. 2012

Firstpage

243

Lastpage

246

Abstract

In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective.

Keywords

parallel processing; text analysis; vectors; MapReduce; cluster computing; near duplicate detection; similarity degree; similarity threshold; text dataset; vector based algorithm; MapReduce; near duplicate detection;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on

Conference_Location

Changchun

Print_ISBN

978-1-4673-2963-7

Type

conf

DOI

10.1109/ICCSNT.2012.6525930

Filename

6525930

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2298416