• DocumentCode
    2286108
  • Title

    Enhancing URL Normalization Using Metadata of Web Pages

  • Author

    Soon, Lay-Ki ; Lee, Sang Ho

  • Author_Institution
    Sch. of Inf. Technol., Soongsil Univ., Seoul
  • fYear
    2008
  • fDate
    20-22 Dec. 2008
  • Firstpage
    331
  • Lastpage
    335
  • Abstract
    In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.
  • Keywords
    Web sites; hypermedia markup languages; meta data; HTML parsing; URL normalization; Web pages; body text; metadata; page size; Costs; Data mining; HTML; Information technology; Robustness; Service oriented architecture; Uniform resource locators; Web pages; Web server; World Wide Web; URL Normalization; Web Crawling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Electrical Engineering, 2008. ICCEE 2008. International Conference on
  • Conference_Location
    Phuket
  • Print_ISBN
    978-0-7695-3504-3
  • Type

    conf

  • DOI
    10.1109/ICCEE.2008.112
  • Filename
    4741001