• DocumentCode
    228585
  • Title

    Virtualized dynamic URL assignment web crawling model

  • Author

    Bhaginath, Wani Rohit ; Shingade, Sandip ; Shirole, Mahesh

  • Author_Institution
    Dept. of CE &IT, V.J.T.I., Mumbai, India
  • fYear
    2014
  • fDate
    1-2 Aug. 2014
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    Web search engines are software systems that help to retrieve the information from the net by accepting the input in the form of query and providing the result as files, pages, images or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling processes. Despite the continuous improvement in the crawling processes still there is a need of improvement towards more efficient and low cost crawler. Most of the crawlers existing today have a centralized coordinator that brings the disadvantage of single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of the existing web crawlers: the first is to create a low cost web crawler using the concept of virtualization of cloud computing. The second issue is a balanced load distribution based on dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a clustering algorithm that assigns requests to the machines as per the availability of the clusters thereby realizing the balance among components according to their real-time condition. This paper discusses a distributed architecture and details of the implementation of the proposed algorithm.
  • Keywords
    Web sites; cloud computing; information retrieval; online front-ends; search engines; Web crawling model; Web pages; Web search engine; balanced load distribution; centralized coordinator; cloud computing; clustering algorithm; distributed Web crawler; distributed architecture; dynamic assignment; low cost Web crawler; multicore processor; mutlicore machines; software system; virtual machine; virtualized dynamic URL assignment; Computational modeling; Crawlers; HTML; Hardware; Pipeline processing; Software; Uniform resource locators; Clustering algorithm; Crawler; Dynamic assignment; K-means clustering; Seeds; Virtualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Engineering and Technology Research (ICAETR), 2014 International Conference on
  • Conference_Location
    Unnao
  • ISSN
    2347-9337
  • Type

    conf

  • DOI
    10.1109/ICAETR.2014.7012963
  • Filename
    7012963