• DocumentCode
    2160459
  • Title

    Efficient focused crawling based on best first search

  • Author

    Rawat, Seema ; Patil, D.R.

  • Author_Institution
    Dept. of Comput. Eng., RCPIT, Dhule, India
  • fYear
    2013
  • fDate
    22-23 Feb. 2013
  • Firstpage
    908
  • Lastpage
    911
  • Abstract
    The World Wide Web continues to grow at an exponential rate, so fetching information about a special-topic is gaining importance which poses exceptional scaling challenges for general-purpose crawlers and search engines. This paper describes a web crawling approach based on best first search. As the goal of a focused crawler is to selectively seek out pages that are relevant to given keywords. Rather than collecting and indexing all available web documents to be able to answer all possible queries, a focused crawler analyze its crawl boundary to hit upon the links that are likely to be most relevant for the crawl, and avoids irrelevant links of the document. This leads to significant savings in hardware as well as network resources and also helps keep the crawl more up-to-date. To accomplish such goal-directed crawling, we select top most k relevant documents for a given query and then expand the most promising link chosen according to link score, to circumvent irrelevant regions of the web.
  • Keywords
    Internet; document handling; query processing; search engines; Web crawling approach; Web documents; World Wide Web; best first search; crawl boundary; exceptional scaling challenge; focused crawling; general-purpose crawlers; goal-directed crawling; keywords; link score; query; search engines; Computers; Conferences; Crawlers; Frequency conversion; Search engines; Uniform resource locators; Web pages; Focused web crawler; Query specific search; Relevancy calculation; TF-IDF;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advance Computing Conference (IACC), 2013 IEEE 3rd International
  • Conference_Location
    Ghaziabad
  • Print_ISBN
    978-1-4673-4527-9
  • Type

    conf

  • DOI
    10.1109/IAdCC.2013.6514347
  • Filename
    6514347