Effectively and efficiently detect web page duplication

Author

Han, Zhongming ; Mo, Qian ; Liu, Hongzhi ; Sun, Jianzhi

Author_Institution

Sch. of Comput. Sci. & Inf. Eng., Beijing Technol. & Bus. Univ., Beijing, China

fYear

2009

fDate

1-4 Nov. 2009

Firstpage

Lastpage

Abstract

There are a lot of redundant Web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated Web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated Web pages can be efficiently detected simply by tag statistic and text comparison.

Keywords

Internet; Web sites; text analysis; Internet; Web page duplication; similarity text paragraphs detection algorithms; tag statistic; text similarity comparison; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Digital Information Management, 2009. ICDIM 2009. Fourth International Conference on

Conference_Location

Ann Arbor, MI

Print_ISBN

978-1-4244-4253-9

Electronic_ISBN

978-1-4244-4254-6

Type

conf

DOI

10.1109/ICDIM.2009.5356801

Filename

5356801

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2718535