Title :
Bulk-Synchronous On-Line Crawling on Clusters of Computers
Author :
Marin, Mauricio ; Bonacic, Carolina
Author_Institution :
Santiago Univ. de Santiago de Chile, Santiago
Abstract :
This paper describes the design of a crawler devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries. The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk- synchronous so that it can be integrated into the same P- processors cluster executing the search engine. This paper describes and evaluates the practical feasibility of such a crawler.
Keywords :
Internet; concurrency control; query processing; search engines; Web document retrieval; bulk-synchronous on-line crawling; computer clusters; concurrency control; search engine; user queries; Bandwidth; Crawlers; Delay; Indexing; Information retrieval; Processor scheduling; Robots; Search engines; Uniform resource locators; Yarn; Web crawling; parallel computing;
Conference_Titel :
Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on
Conference_Location :
Toulouse
Print_ISBN :
978-0-7695-3089-5
DOI :
10.1109/PDP.2008.84