• DocumentCode
    2596696
  • Title

    Architecture for Parallel Crawling and Algorithm for Change Detection in Web Pages

  • Author

    Yadav, Divakar ; Sharma, A.K. ; Gupta, J.P. ; Garg, N. ; Mahajan, A.

  • Author_Institution
    JIITU, Noida
  • fYear
    2007
  • fDate
    17-20 Dec. 2007
  • Firstpage
    258
  • Lastpage
    264
  • Abstract
    In this paper, we put forward a technique for parallel crawling of the Web. The World Wide Web today is growing at a phenomenal rate. The size of the Web as on February 2007 stands at around 29 billion pages. One of the most important uses of crawling the Web is for indexing purposes and keeping Web pages up-to-date, later used by search engine to serve the end user queries. The paper puts forward an architecture built on the lines of a client server architecture. It discusses a fresh approach for parallel crawling the Web using multiple machines and integrates the trivial issues of crawling also. A major part of the Web is dynamic and hence, a need arises to constantly update the changed Web pages. We have used a three-step algorithm for page refreshment. This checks for whether the structure of a Web page has been changed or not, the text content has been altered or whether an image is changed. For the server we have discussed a unique method for distribution of URLs to client machines after determination of their priority index. Also a minor variation to the method of prioritizing URLs on the basis of forward link count has been discussed to accommodate the purpose of frequency of update.
  • Keywords
    Web sites; search engines; URL; Web pages; World Wide Web; change detection; page refreshment; parallel crawling; priority index; search engine; text content; three-step algorithm; Bandwidth; Change detection algorithms; Crawlers; Frequency; Indexing; Search engines; Service oriented architecture; Uniform resource locators; Web pages; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology, (ICIT 2007). 10th International Conference on
  • Conference_Location
    Orissa
  • Print_ISBN
    0-7695-3068-0
  • Type

    conf

  • DOI
    10.1109/ICIT.2007.64
  • Filename
    4418309