Title :
Rough Set Based Ensemble Prediction for Topic Specific Web Crawling
Author :
Saha, Suman ; Murthy, C.A. ; Pal, Sankar K.
Author_Institution :
Center for Soft Comput. Res., Indian Stat. Inst., Kolkata
Abstract :
The rapid growth of the World Wide Web had made the problem of useful resource discovery an important one in recent years. Several techniques such as focused crawling and intelligent crawling have recently been proposed for topic specific resource discovery. All these crawlers use the hypertext features behavior in order to perform topic specific resource discovery. A focused crawler uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then it picks up the best URL to crawl next. Focused crawlers rely on different types of features of the crawled pages to keep the crawling scope within the desired domain and they are obtained from URL, anchor text, link structure and text contents of the parent and ancestor pages. Different focused crawling algorithms use these different set of features to predict the relevance and quality of the unvisited Web pages. In this article a combined method based on rough set theory has been proposed. It combines the available predictions using decision rules and can build much larger domain-specific collections with less noise. Our experiment in this regard has provided better Harvest rate and better target recall for focused crawling.
Keywords :
Internet; Web sites; hypermedia; rough set theory; URL extraction; Web pages; World Wide Web; hypertext features behavior; resource discovery; rough set based ensemble prediction; rough set theory; topic specific Web crawling; Crawlers; Information retrieval; Pattern recognition; Search engines; Set theory; Software libraries; Taxonomy; Uniform resource locators; Web pages; Web sites; Classification; Web resource discovery;
Conference_Titel :
Advances in Pattern Recognition, 2009. ICAPR '09. Seventh International Conference on
Conference_Location :
Kolkata
Print_ISBN :
978-1-4244-3335-3
DOI :
10.1109/ICAPR.2009.17