Title :
Efficient focused crawling based on best first search
Author :
Rawat, Seema ; Patil, D.R.
Author_Institution :
Dept. of Comput. Eng., RCPIT, Dhule, India
Abstract :
The World Wide Web continues to grow at an exponential rate, so fetching information about a special-topic is gaining importance which poses exceptional scaling challenges for general-purpose crawlers and search engines. This paper describes a web crawling approach based on best first search. As the goal of a focused crawler is to selectively seek out pages that are relevant to given keywords. Rather than collecting and indexing all available web documents to be able to answer all possible queries, a focused crawler analyze its crawl boundary to hit upon the links that are likely to be most relevant for the crawl, and avoids irrelevant links of the document. This leads to significant savings in hardware as well as network resources and also helps keep the crawl more up-to-date. To accomplish such goal-directed crawling, we select top most k relevant documents for a given query and then expand the most promising link chosen according to link score, to circumvent irrelevant regions of the web.
Keywords :
Internet; document handling; query processing; search engines; Web crawling approach; Web documents; World Wide Web; best first search; crawl boundary; exceptional scaling challenge; focused crawling; general-purpose crawlers; goal-directed crawling; keywords; link score; query; search engines; Computers; Conferences; Crawlers; Frequency conversion; Search engines; Uniform resource locators; Web pages; Focused web crawler; Query specific search; Relevancy calculation; TF-IDF;
Conference_Titel :
Advance Computing Conference (IACC), 2013 IEEE 3rd International
Conference_Location :
Ghaziabad
Print_ISBN :
978-1-4673-4527-9
DOI :
10.1109/IAdCC.2013.6514347