• DocumentCode
    531395
  • Title

    CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates

  • Author

    Heejin Park ; Lee, Sang-Chul ; Lee, Soon-Haeng ; Kim, Sang-Wook

  • Author_Institution
    Dept. of Electron. & Comput. Eng., Hanyang Univ., Seoul, South Korea
  • Volume
    1
  • fYear
    2010
  • fDate
    Aug. 31 2010-Sept. 3 2010
  • Firstpage
    112
  • Lastpage
    119
  • Abstract
    A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, Central Match, to identify blog-duplicates efficiently and accurately. When searching a document database for possible log-duplicates of a given document, Central Match runs50 times faster than MinHashing. In addition, Central Match identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and Central Match are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means Central Match finds 80% more blog-duplicates than MinHashing.
  • Keywords
    Internet; document handling; indexing; search engines; string matching; CentralMatch; MinHashing; Web search engines; blog-duplicate identification; document database; indexing; near-duplicate identification methods; string matching; Blog posts; Duplicate identification; Indexing; String matching; Web search engines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
  • Conference_Location
    Toronto, ON
  • Print_ISBN
    978-1-4244-8482-9
  • Electronic_ISBN
    978-0-7695-4191-4
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2010.98
  • Filename
    5616218