CentralMatch: A Fast and Accurate Method to Identify Blog-Duplicates

Author

Heejin Park ; Lee, Sang-Chul ; Lee, Soon-Haeng ; Kim, Sang-Wook

Author_Institution

Dept. of Electron. & Comput. Eng., Hanyang Univ., Seoul, South Korea

Volume

1

fYear

2010

fDate

Aug. 31 2010-Sept. 3 2010

Firstpage

112

Lastpage

119

Abstract

A group of documents is called near-duplicates if they are almost the same with just a slight difference. Since near-duplicates are major concerns of Web search engines, it is necessary to identify and filter them effectively. Among existing near-duplicate identification methods, MinHashing is the most well-known one. It identifies near-duplicates regardless of locations of different parts in two documents. In blog environment, however, most near-duplicates differ only in their beginning or end. According to our preliminary experiment, about 99% of near-duplicates differ in the beginning or end (blog-duplicates hereafter) and only 1% of them differ in the middle. Thus, blog-duplicates have a long matched sequence in their central parts. Based on this important observation, we present a novel algorithm, Central Match, to identify blog-duplicates efficiently and accurately. When searching a document database for possible log-duplicates of a given document, Central Match runs50 times faster than MinHashing. In addition, Central Match identifies blog-duplicates more accurately than MinHashing. According to our experiments, when the precisions of Min-Hashing and Central Match are fixed to 0.9, their recalls are around 0.5 and 0.9, respectively, which means Central Match finds 80% more blog-duplicates than MinHashing.

Keywords

Internet; document handling; indexing; search engines; string matching; CentralMatch; MinHashing; Web search engines; blog-duplicate identification; document database; indexing; near-duplicate identification methods; string matching; Blog posts; Duplicate identification; Indexing; String matching; Web search engines;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on

Conference_Location

Toronto, ON

Print_ISBN

978-1-4244-8482-9

Electronic_ISBN

978-0-7695-4191-4

Type

conf

DOI

10.1109/WI-IAT.2010.98

Filename

5616218