DocumentCode :
2025920
Title :
Identifying Equivalent URLs Using URL Signatures
Author :
Soon, Lay-Ki ; Lee, Sang Ho
Author_Institution :
Sch. of Inf. Technol., Soongsil Univ., Seoul, South Korea
fYear :
2008
fDate :
Nov. 30 2008-Dec. 3 2008
Firstpage :
203
Lastpage :
210
Abstract :
In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to enhance the standard URL normalization by incorporating the semantically meaningful metadata of the Web pages. The metadata taken into account are the body texts of the Web pages, which can be extracted during HTML parsing. Given a URL which has undergone the standard normalization mechanism, we construct its URL signature by hashing or fingerprinting the body text of the associated Web page using Message-Digest algorithm 5. URLs which share identical signatures are considered to be equivalent in our scheme. The experimental results show that our proposed method helps to further reduce redundant Web information retrieval by 34.57% in comparison with the standard URL normalization mechanism.
Keywords :
Internet; hypermedia markup languages; meta data; HTML parsing; URL signatures; Web information retrieval; Web page meta data; body text fingerprinting; body text hashing; equivalent URL; message-digest algorithm; standard URL normalization mechanism; Crawlers; Fingerprint recognition; HTML; Information retrieval; Information technology; Internet; Signal processing; Uniform resource locators; Web pages; World Wide Web; URL Normalization; Web Crawling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Image Technology and Internet Based Systems, 2008. SITIS '08. IEEE International Conference on
Conference_Location :
Bali
Print_ISBN :
978-0-7695-3493-0
Type :
conf
DOI :
10.1109/SITIS.2008.21
Filename :
4725805
Link To Document :
بازگشت