Title :
Near duplicate detection using MapReduce
Author :
Qinsheng Du ; Wei Liu ; Guolin Li ; Yonglin Tang
Author_Institution :
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
Abstract :
In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective.
Keywords :
parallel processing; text analysis; vectors; MapReduce; cluster computing; near duplicate detection; similarity degree; similarity threshold; text dataset; vector based algorithm; MapReduce; near duplicate detection;
Conference_Titel :
Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
Conference_Location :
Changchun
Print_ISBN :
978-1-4673-2963-7
DOI :
10.1109/ICCSNT.2012.6525930