DocumentCode
2718535
Title
Effectively and efficiently detect web page duplication
Author
Han, Zhongming ; Mo, Qian ; Liu, Hongzhi ; Sun, Jianzhi
Author_Institution
Sch. of Comput. Sci. & Inf. Eng., Beijing Technol. & Bus. Univ., Beijing, China
fYear
2009
fDate
1-4 Nov. 2009
Firstpage
1
Lastpage
6
Abstract
There are a lot of redundant Web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated Web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated Web pages can be efficiently detected simply by tag statistic and text comparison.
Keywords
Internet; Web sites; text analysis; Internet; Web page duplication; similarity text paragraphs detection algorithms; tag statistic; text similarity comparison; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Digital Information Management, 2009. ICDIM 2009. Fourth International Conference on
Conference_Location
Ann Arbor, MI
Print_ISBN
978-1-4244-4253-9
Electronic_ISBN
978-1-4244-4254-6
Type
conf
DOI
10.1109/ICDIM.2009.5356801
Filename
5356801
Link To Document