A priority-based method of near-duplicated text information of web pages deletion

Author

Ling, Yun ; Tao, Xiaobo ; Lv, Hexin

Author_Institution

Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China

fYear

2010

Firstpage

495

Lastpage

499

Abstract

Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.

Keywords

Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;

fLanguage

English

Publisher

ieee

Conference_Titel

Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-6054-0

Type

conf

DOI

10.1109/ICSESS.2010.5552319

Filename

5552319