DocumentCode
3057951
Title
Bulk-Synchronous On-Line Crawling on Clusters of Computers
Author
Marin, Mauricio ; Bonacic, Carolina
Author_Institution
Santiago Univ. de Santiago de Chile, Santiago
fYear
2008
fDate
13-15 Feb. 2008
Firstpage
414
Lastpage
421
Abstract
This paper describes the design of a crawler devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries. The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk- synchronous so that it can be integrated into the same P- processors cluster executing the search engine. This paper describes and evaluates the practical feasibility of such a crawler.
Keywords
Internet; concurrency control; query processing; search engines; Web document retrieval; bulk-synchronous on-line crawling; computer clusters; concurrency control; search engine; user queries; Bandwidth; Crawlers; Delay; Indexing; Information retrieval; Processor scheduling; Robots; Search engines; Uniform resource locators; Yarn; Web crawling; parallel computing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on
Conference_Location
Toulouse
ISSN
1066-6192
Print_ISBN
978-0-7695-3089-5
Type
conf
DOI
10.1109/PDP.2008.84
Filename
4457152
Link To Document