DocumentCode
1635814
Title
A priority-based method of near-duplicated text information of web pages deletion
Author
Ling, Yun ; Tao, Xiaobo ; Lv, Hexin
Author_Institution
Coll. of Comput. Sci. & Inf. Eng., Zhejiang Gongshang Univ., Hangzhou, China
fYear
2010
Firstpage
495
Lastpage
499
Abstract
Duplicated web pages that search engine returns not only waste storage resources but also increase the burden on web users. According to the near-duplicated phenomenon in the field of employment such as the professional web pages, a new method to detect and delete near-duplicated web page priority-based on text information is proposed. By this method, an algorithm to extract text information of web pages by DOM tree and priority-based algorithm for detecting near-duplicated text information is implemented, so as to reduce the noise of web pages and improve the efficiency of detecting the near-duplicated text information. The experimental results indicate that completely and partly duplicated web pages is detected accurately.
Keywords
Internet; text analysis; Web page deletion; near-duplicated text information; priority-based method; Algorithm design and analysis; Containers; Data mining; Employment; HTML; Noise; Web pages; DOM tree; detect and delete near-duplicated web pages; information extraction; search engine;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-6054-0
Type
conf
DOI
10.1109/ICSESS.2010.5552319
Filename
5552319
Link To Document