Title :
Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages
Author :
Yongzhuang Wei ; Shuai Wang ; Chunfeng Yuan ; Yihua Huang
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
Abstract :
A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents. This paper presents an integrated algorithm, KMatch, for near-duplicate document detection of large scale Chinese Web pages. First of all, KMatch employs Chinese segmentation algorithm to prepare Chinese words into meaningful features to compress documents. Then keywords matching technique is used to improve the accuracy of document detection. For further accuracy improvement, KMatch also combines IMatch algorithms to filter out the noise contents of a web document and retain the body text. To improve detection performance, we integrate the Shingling algorithm to compress huge datasets into smaller ones. Finally, to further improve the detection performance on large scale Chinese web pages, we design and implement KMatch algorithm in parallel with MapReduce. The experimental results show that our approach achieves both high precision and recall, and the parallelized algorithm with MapReduce achieves good performance and scalability when dealing with large scale of datasets.
Keywords :
Web sites; data compression; distributed processing; document handling; natural language processing; search engines; Chinese documents; Chinese segmentation algorithm; English documents; IMatch algorithms; Internet; KMatch; MapReduce; Shingling algorithm; document compression; large scale Chinese Web pages; parallelized near-duplicate document detection algorithm; search engines; web document detection algorithms; Accuracy; Algorithm design and analysis; Filtering algorithms; Information filters; Search engines; Web pages; Chinese web pages; IMatch; KMatch; Keywords matching; Large scale web documents; MapReduce; Near-duplicate document detection; Shingling;
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2012 13th International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-0-7695-4879-1
DOI :
10.1109/PDCAT.2012.108