DocumentCode
2450578
Title
A Web Page De-duplication Algorithm Based on Data Clearing
Author
Lin, Jian-ming ; Liu, Dong-sheng ; Gao, Shi-wen ; Chen, Wei
Author_Institution
Sch. of Bus. Adm., Zhejiang Gongshang Univ., Hangzhou, China
fYear
2009
fDate
25-26 April 2009
Firstpage
544
Lastpage
547
Abstract
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of userspsila browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.
Keywords
Internet; Web sites; information retrieval; search engines; Web page deduplication; data clearing; information retrieval; search engines; Cleaning; Computer science; Data engineering; Data mining; Educational institutions; Feature extraction; Information retrieval; Internet; Search engines; Web pages; data cleaning; feature codes; reshipment statement; web page de-duplication;
fLanguage
English
Publisher
ieee
Conference_Titel
Artificial Intelligence, 2009. JCAI '09. International Joint Conference on
Conference_Location
Hainan Island
Print_ISBN
978-0-7695-3615-6
Type
conf
DOI
10.1109/JCAI.2009.181
Filename
5159062
Link To Document