DocumentCode
3353140
Title
The Implementation of a Web Crawler URL Filter Algorithm Based on Caching
Author
Hui-chang, Wang ; Shu-hua, Ruan ; Qi-jie, Tang
Author_Institution
Sch. of Comput. Sci., Sichuan Univ., Chengdu, China
Volume
2
fYear
2009
fDate
28-30 Oct. 2009
Firstpage
453
Lastpage
456
Abstract
For large-scale Web information collection, the URL filter module plays important roles in a Web crawler which is a central component of a search engine. The performance of an URL filter module influents the efficiency of the entire collection system directly. This paper introduces one URL filter algorithm based on caching and its implementation. The performances of stability and paralleling of the algorithm are verified by the experiments for Websites which handle a large number of Web pages. Experiment results show the algorithm proposed in this paper can achieve satisfactory performances through reasonable adjustments of its some parameters and it is suitable for the process of the URL filter of a Website which has a number of page navigator links and index pages especially.
Keywords
Web sites; cache storage; information filters; search engines; URL filter; Web crawler; Web page; Web site; caching; index page; large-scale Web information collection; page navigator links; search engine; Computer science; Crawlers; Electronic mail; Information filtering; Information filters; Internet; Navigation; Search engines; Uniform resource locators; Web pages; Caching; URL Filter; Web Crawler;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science and Engineering, 2009. WCSE '09. Second International Workshop on
Conference_Location
Qingdao
Print_ISBN
978-0-7695-3881-5
Type
conf
DOI
10.1109/WCSE.2009.851
Filename
5403354
Link To Document