• DocumentCode
    2298416
  • Title

    Near duplicate detection using MapReduce

  • Author

    Qinsheng Du ; Wei Liu ; Guolin Li ; Yonglin Tang

  • Author_Institution
    Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
  • fYear
    2012
  • fDate
    29-31 Dec. 2012
  • Firstpage
    243
  • Lastpage
    246
  • Abstract
    In the massive text dataset, the near duplicate detection issue is widely existed in the real world. In this paper, The vector based algorithm is proposed to detect near duplicate in MapReduce. Given a text set and a similarity threshold, the algorithm can effectively return the similarity pairs whose similarity degree is no less than the threshold. Experimental results on the real datasets show that the algorithm is effective.
  • Keywords
    parallel processing; text analysis; vectors; MapReduce; cluster computing; near duplicate detection; similarity degree; similarity threshold; text dataset; vector based algorithm; MapReduce; near duplicate detection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
  • Conference_Location
    Changchun
  • Print_ISBN
    978-1-4673-2963-7
  • Type

    conf

  • DOI
    10.1109/ICCSNT.2012.6525930
  • Filename
    6525930