• DocumentCode
    3211405
  • Title

    Applying site information to information retrieval from the Web

  • Author

    Asano, Yasuhito ; Imai, Hiroshi ; Toyoda, Masashi ; Kitsuregawa, Masaru

  • Author_Institution
    Graduate Sch. of Sci., Univ. of Tokyo, Japan
  • fYear
    2002
  • fDate
    12-14 Dec. 2002
  • Firstpage
    83
  • Lastpage
    92
  • Abstract
    In recent years, several information retrieval methods using information about Web-links have been developed, such as HITS and trawling. In order to analyze Web-links dividing into links inside each Web site (local-links) and links between Web sites (global-links)for information retrieval, a proper model of the Web site is required. In existing research, a Web server is used as a model of the Web site. This idea works relatively well when a Web site corresponds to a server, as is the case for public Web sites, but works poorly when multiple Web sites correspond to a server, as is the case for private Web sites on rental Web servers. We propose a new model of the Web site, "directory-based site", to handle typical private sites, and a method to identify them using information about the URL and Web-links. We verify the method can approximately identify, at a rate of 66% of over 110,000 servers, whether each server has multiple directory-based sites or not, and extract over 500,000 directory-based sites and 4 million global-links by computational experiments using jp-domain URLs and Web-link data contains over 23 million URLs and 100 million Web-links, collected from July to August 2000, by Toyoda and Kitsuregawa. We also propose a new framework of Web-link based information retrieval that uses directory-based sites and global-links instead of Web pages and whole Web-links respectively, and examine the effectiveness of our framework by comparing a result of trawling on our framework to one on the existing framework.
  • Keywords
    Web sites; information retrieval; HITS; URL; Web links; Web server; Web sites; directory based site; global links; information retrieval methods; jp-domain URLs; local links; private sites; trawling; Gold; Information analysis; Information retrieval; Information science; Search engines; Search methods; Toy industry; Uniform resource locators; Web pages; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on
  • Print_ISBN
    0-7695-1766-8
  • Type

    conf

  • DOI
    10.1109/WISE.2002.1181646
  • Filename
    1181646