Efficient focused crawling based on best first search

Author

Rawat, Seema ; Patil, D.R.

Author_Institution

Dept. of Comput. Eng., RCPIT, Dhule, India

fYear

2013

fDate

22-23 Feb. 2013

Firstpage

908

Lastpage

911

Abstract

The World Wide Web continues to grow at an exponential rate, so fetching information about a special-topic is gaining importance which poses exceptional scaling challenges for general-purpose crawlers and search engines. This paper describes a web crawling approach based on best first search. As the goal of a focused crawler is to selectively seek out pages that are relevant to given keywords. Rather than collecting and indexing all available web documents to be able to answer all possible queries, a focused crawler analyze its crawl boundary to hit upon the links that are likely to be most relevant for the crawl, and avoids irrelevant links of the document. This leads to significant savings in hardware as well as network resources and also helps keep the crawl more up-to-date. To accomplish such goal-directed crawling, we select top most k relevant documents for a given query and then expand the most promising link chosen according to link score, to circumvent irrelevant regions of the web.

Keywords

Internet; document handling; query processing; search engines; Web crawling approach; Web documents; World Wide Web; best first search; crawl boundary; exceptional scaling challenge; focused crawling; general-purpose crawlers; goal-directed crawling; keywords; link score; query; search engines; Computers; Conferences; Crawlers; Frequency conversion; Search engines; Uniform resource locators; Web pages; Focused web crawler; Query specific search; Relevancy calculation; TF-IDF;

fLanguage

English

Publisher

ieee

Conference_Titel

Advance Computing Conference (IACC), 2013 IEEE 3rd International

Conference_Location

Ghaziabad

Print_ISBN

978-1-4673-4527-9

Type

conf

DOI

10.1109/IAdCC.2013.6514347

Filename

6514347