DocumentCode :
3353140
Title :
The Implementation of a Web Crawler URL Filter Algorithm Based on Caching
Author :
Hui-chang, Wang ; Shu-hua, Ruan ; Qi-jie, Tang
Author_Institution :
Sch. of Comput. Sci., Sichuan Univ., Chengdu, China
Volume :
2
fYear :
2009
fDate :
28-30 Oct. 2009
Firstpage :
453
Lastpage :
456
Abstract :
For large-scale Web information collection, the URL filter module plays important roles in a Web crawler which is a central component of a search engine. The performance of an URL filter module influents the efficiency of the entire collection system directly. This paper introduces one URL filter algorithm based on caching and its implementation. The performances of stability and paralleling of the algorithm are verified by the experiments for Websites which handle a large number of Web pages. Experiment results show the algorithm proposed in this paper can achieve satisfactory performances through reasonable adjustments of its some parameters and it is suitable for the process of the URL filter of a Website which has a number of page navigator links and index pages especially.
Keywords :
Web sites; cache storage; information filters; search engines; URL filter; Web crawler; Web page; Web site; caching; index page; large-scale Web information collection; page navigator links; search engine; Computer science; Crawlers; Electronic mail; Information filtering; Information filters; Internet; Navigation; Search engines; Uniform resource locators; Web pages; Caching; URL Filter; Web Crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Engineering, 2009. WCSE '09. Second International Workshop on
Conference_Location :
Qingdao
Print_ISBN :
978-0-7695-3881-5
Type :
conf
DOI :
10.1109/WCSE.2009.851
Filename :
5403354
Link To Document :
بازگشت