Title :
Online system for detection of Chinese near-duplicate documents
Author :
Yang Yang ; Yuquan Chen
Author_Institution :
Dept. of Comput. Sci. & Eng., Shanghai Jiao Tong Univ., Shanghai, China
Abstract :
In various types of information retrieval systems, searching engines, and some data-mining systems, there is one task cannot be avoided-how to detect the large-scale duplicate and near-duplicate documents rapidly. Too many duplicates will influence our systems in many aspects malignantly. For example it reduces the computational performance, cuts down the user experience and so on. On the other hand, if quantity of documents increases dynamically, we should take another way to tackle this problem. This paper aims to construct a practical online detection system under the guidance of the fingerprint extraction technique based on simhash. Our contribution is that we develop a system running online, which means we don´t know the accurate quantity of the documents before, and the system is able to accept new documents anytime. It requires efficiency and flexibility, and we propose a favorable solution.
Keywords :
Internet; data mining; document handling; fingerprint identification; information retrieval; search engines; Chinese near-duplicate document detection; data mining systems; fingerprint extraction technique; information retrieval systems; large-scale duplicate documents; online detection system; searching engines; simhash; Chinese document; Hamming distance; near-duplicate; online detection; simhash;
Conference_Titel :
Information Science and Service Science and Data Mining (ISSDM), 2012 6th International Conference on New Trends in
Conference_Location :
Taipei
Print_ISBN :
978-1-4673-0876-2