A Web Page De-duplication Algorithm Based on Data Clearing

Author

Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei

Author_Institution

Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China

fYear

2009

fDate

25-26 April 2009

Firstpage

544

Lastpage

547

Abstract

Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.

Keywords

Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;

fLanguage

English

Publisher

ieee

Conference_Titel

Artificial Intelligence, 2009. JCAI '09. International Joint Conference on

Conference_Location

Hainan Island

Print_ISBN

978-0-7695-3615-6

Type

conf

DOI

10.1109/JCAI.2009.181

Filename

5159062

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2450578