DocumentCode
1946013
Title
UDBFC: An effective focused crawling approach based on URL Distance calculation
Author
Hati, Debashis ; Kumar, Amritesh
Author_Institution
Sch. of Comput. Eng., KIIT Univ., Bhubaneswar, India
Volume
2
fYear
2010
fDate
9-11 July 2010
Firstpage
59
Lastpage
63
Abstract
Vertical search engines use focused crawlers as their key component and develops some specific algorithms to select web pages relevant to some pre-defined set of topics. Therefore, to effectively build up a semantic pattern for specific topics is extremely important to such search engines. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. Here we propose an UDBFC (URL Distance Based Focused Crawler) algorithm based on a double crawler framework (an experimental crawler and a focused crawler). The main motive of our UDBFC is to measure the relevancy between seed page and child page by vector space model. Seed pages are the common search result generated by three most popular search engine Google, Yahoo and MSN search. Child page links are out links of seed page which are extracted by link extractor tool from seed page. Seed page and child page are fetched by experimental crawler. It calculates the relevancy between seed page and its all child pages. After relevancy calculation it defines groups based on relevancy score. It uses the focused crawler to fetch topic specific pages from internet based on distance score which is calculated between grouped URLs and each URL which is to be fetched.
Keywords
Web sites; relevance feedback; search engines; semantic Web; Internet; MSN search; UDBFC; URL distance based focused crawler algorithm; Web page retrieval; Yahoo search; child page; google search; hyperlink; link extractor tool; search engine; seed page; semantic pattern; uniform resource locator; vector space model; Australia; Crawlers; Pediatrics; World Wide Web; distance calculation; focused crawler; vector space model; vertical search engine;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location
Chengdu
Print_ISBN
978-1-4244-5537-9
Type
conf
DOI
10.1109/ICCSIT.2010.5564423
Filename
5564423
Link To Document