• DocumentCode
    3057951
  • Title

    Bulk-Synchronous On-Line Crawling on Clusters of Computers

  • Author

    Marin, Mauricio ; Bonacic, Carolina

  • Author_Institution
    Santiago Univ. de Santiago de Chile, Santiago
  • fYear
    2008
  • fDate
    13-15 Feb. 2008
  • Firstpage
    414
  • Lastpage
    421
  • Abstract
    This paper describes the design of a crawler devised to perform the periodic retrieval of Web documents for a search engine able to accept on-line updates in a concurrent manner. On-line updates comes in the form of insertions of new documents or update of existing ones, all of them mixed with the usual user queries. The search engine is bulk-synchronous which allows it to deal efficiently with the concurrency control problem. The crawler is also bulk- synchronous so that it can be integrated into the same P- processors cluster executing the search engine. This paper describes and evaluates the practical feasibility of such a crawler.
  • Keywords
    Internet; concurrency control; query processing; search engines; Web document retrieval; bulk-synchronous on-line crawling; computer clusters; concurrency control; search engine; user queries; Bandwidth; Crawlers; Delay; Indexing; Information retrieval; Processor scheduling; Robots; Search engines; Uniform resource locators; Yarn; Web crawling; parallel computing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on
  • Conference_Location
    Toulouse
  • ISSN
    1066-6192
  • Print_ISBN
    978-0-7695-3089-5
  • Type

    conf

  • DOI
    10.1109/PDP.2008.84
  • Filename
    4457152