DocumentCode :
2286108
Title :
Enhancing URL Normalization Using Metadata of Web Pages
Author :
Soon, Lay-Ki ; Lee, Sang Ho
Author_Institution :
Sch. of Inf. Technol., Soongsil Univ., Seoul
fYear :
2008
fDate :
20-22 Dec. 2008
Firstpage :
331
Lastpage :
335
Abstract :
In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.
Keywords :
Web sites; hypermedia markup languages; meta data; HTML parsing; URL normalization; Web pages; body text; metadata; page size; Costs; Data mining; HTML; Information technology; Robustness; Service oriented architecture; Uniform resource locators; Web pages; Web server; World Wide Web; URL Normalization; Web Crawling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Electrical Engineering, 2008. ICCEE 2008. International Conference on
Conference_Location :
Phuket
Print_ISBN :
978-0-7695-3504-3
Type :
conf
DOI :
10.1109/ICCEE.2008.112
Filename :
4741001
Link To Document :
بازگشت