• DocumentCode
    3353140
  • Title

    The Implementation of a Web Crawler URL Filter Algorithm Based on Caching

  • Author

    Hui-chang, Wang ; Shu-hua, Ruan ; Qi-jie, Tang

  • Author_Institution
    Sch. of Comput. Sci., Sichuan Univ., Chengdu, China
  • Volume
    2
  • fYear
    2009
  • fDate
    28-30 Oct. 2009
  • Firstpage
    453
  • Lastpage
    456
  • Abstract
    For large-scale Web information collection, the URL filter module plays important roles in a Web crawler which is a central component of a search engine. The performance of an URL filter module influents the efficiency of the entire collection system directly. This paper introduces one URL filter algorithm based on caching and its implementation. The performances of stability and paralleling of the algorithm are verified by the experiments for Websites which handle a large number of Web pages. Experiment results show the algorithm proposed in this paper can achieve satisfactory performances through reasonable adjustments of its some parameters and it is suitable for the process of the URL filter of a Website which has a number of page navigator links and index pages especially.
  • Keywords
    Web sites; cache storage; information filters; search engines; URL filter; Web crawler; Web page; Web site; caching; index page; large-scale Web information collection; page navigator links; search engine; Computer science; Crawlers; Electronic mail; Information filtering; Information filters; Internet; Navigation; Search engines; Uniform resource locators; Web pages; Caching; URL Filter; Web Crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Engineering, 2009. WCSE '09. Second International Workshop on
  • Conference_Location
    Qingdao
  • Print_ISBN
    978-0-7695-3881-5
  • Type

    conf

  • DOI
    10.1109/WCSE.2009.851
  • Filename
    5403354