DocumentCode :
1946013
Title :
UDBFC: An effective focused crawling approach based on URL Distance calculation
Author :
Hati, Debashis ; Kumar, Amritesh
Author_Institution :
Sch. of Comput. Eng., KIIT Univ., Bhubaneswar, India
Volume :
2
fYear :
2010
fDate :
9-11 July 2010
Firstpage :
59
Lastpage :
63
Abstract :
Vertical search engines use focused crawlers as their key component and develops some specific algorithms to select web pages relevant to some pre-defined set of topics. Therefore, to effectively build up a semantic pattern for specific topics is extremely important to such search engines. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. Here we propose an UDBFC (URL Distance Based Focused Crawler) algorithm based on a double crawler framework (an experimental crawler and a focused crawler). The main motive of our UDBFC is to measure the relevancy between seed page and child page by vector space model. Seed pages are the common search result generated by three most popular search engine Google, Yahoo and MSN search. Child page links are out links of seed page which are extracted by link extractor tool from seed page. Seed page and child page are fetched by experimental crawler. It calculates the relevancy between seed page and its all child pages. After relevancy calculation it defines groups based on relevancy score. It uses the focused crawler to fetch topic specific pages from internet based on distance score which is calculated between grouped URLs and each URL which is to be fetched.
Keywords :
Web sites; relevance feedback; search engines; semantic Web; Internet; MSN search; UDBFC; URL distance based focused crawler algorithm; Web page retrieval; Yahoo search; child page; google search; hyperlink; link extractor tool; search engine; seed page; semantic pattern; uniform resource locator; vector space model; Australia; Crawlers; Pediatrics; World Wide Web; distance calculation; focused crawler; vector space model; vertical search engine;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-5537-9
Type :
conf
DOI :
10.1109/ICCSIT.2010.5564423
Filename :
5564423
Link To Document :
بازگشت