Title :
Adaptive focused crawling based on link analysis
Author :
Hati, Debashis ; Sahoo, Biswajit ; Kumar, Amritesh
Author_Institution :
Sch. Of Comput. Eng., KIIT Univ., Bhubaneswar, India
Abstract :
A web search engine is designed to search for information on the World Wide Web (WWW). Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large spam websites, traditional web crawlers cannot function well to solve this problem. Focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. A focused crawler is a program used for searching information related to some interested topics from the Internet. The main property of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves relevant pages only. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we calculate the unvisited URL score based on its Anchor text relevancy, its description in Google search engine and calculate the similarity score of description with topic keywords, cohesive text similarity with topic keywords and Relevancy score of its parent pages. Relevancy score is calculated based on vector space model.
Keywords :
Web sites; online front-ends; search engines; semantic Web; Anchor text relevancy; Google search engine; Internet; URL score; Web crawler; Web document; Web page; Web search engine; World Wide Web; adaptive focused crawling; cohesive text similarity; focused crawler; hyperlinks; link analysis; relevancy score; semantic Web technology; similarity score; spam Web sites; special-purpose search engine; topic keywords; vector space model; Computer science education; Crawlers; Design engineering; Internet; Search engines; Uniform resource locators; Web pages; Web server; Web sites; World Wide Web; crawler; focused crawler; vector space model;
Conference_Titel :
Education Technology and Computer (ICETC), 2010 2nd International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-6367-1
DOI :
10.1109/ICETC.2010.5529641