• DocumentCode
    492503
  • Title

    A Framework of a Hybrid Focused Web Crawler

  • Author

    Sun, Yixue ; Jin, Peiquan ; Yue, Lihua

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Univ. of Sci. & Technol. of China
  • Volume
    2
  • fYear
    2008
  • fDate
    13-15 Dec. 2008
  • Firstpage
    50
  • Lastpage
    53
  • Abstract
    Because of the complex Web structure, most approaches of focused crawling employ a local search algorithm, which will only search pages in a sub-graph of the Web. And the multi-topic feature of Web pages makes it difficult to determine the relevance of a Web page to a given topic. Towards those two issues, in this paper we present a new hybrid approach to focused crawling, which is based on meta-search and VIPS (VIsion based Page Segmentation) algorithm. We use meta-search to achieve a wider crawling range than traditional local search algorithm. Besides, in order to obtain better recall and precision, we use VIPS-based algorithm for the relevance computation of a Web page, which first partitions a Web page into a set of blocks that reflect the semantic structure of the page. The system architecture of hybrid focused crawler is discussed after a short review on related work, and then we present the framework of the hybrid focused crawling approach.
  • Keywords
    Internet; query formulation; VIPS algorithm; Web pages; Web sub-graph; hybrid focused Web crawler; local search algorithm; meta-search; page semantic structure; vision based page segmentation algorithm; Algorithm design and analysis; Conferences; Crawlers; HTML; Hybrid power systems; Metasearch; Partitioning algorithms; Performance analysis; Uniform resource locators; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Future Generation Communication and Networking Symposia, 2008. FGCNS '08. Second International Conference on
  • Conference_Location
    Sanya
  • Print_ISBN
    978-1-4244-3430-5
  • Electronic_ISBN
    978-0-7695-3546-3
  • Type

    conf

  • DOI
    10.1109/FGCNS.2008.73
  • Filename
    4813520