Title :
Design of a Metacrawler for web document retrieval
Author :
Babu, K. R. Remesh ; Arya, A.P.
Author_Institution :
Dept. of Inf. Technol., Gov. Eng. Coll., Idukki, India
Abstract :
Web Crawlers `browse´ the World Wide Web (WWW) on behalf of search engine, to collect web pages from numerous collections of billions of documents. Metacrawler is similar to that of a meta search engine that combines the top web search results from popular search engines. World Wide Web is growing rapidly. This possesses great challenges to general purpose crawlers. This paper introduces an architectural framework of a Metacrawler. This crawler enables the user to retrieve information that is relevant to the topic from more than one traditional web search engines. The crawler works in such a way that it fetches only the pages that are relevant to the topic. The PageRank algorithm is often used in ranking web pages. But, the ranking causes the problem of topic-drift. So, modified PageRank algorithm is used to rank the retrieved web pages in such a way that it reduces this problem. The clustering method is used to combine the search results so that the user can easily select web pages from the clustered results based upon the requirement. Experimental results show the effectiveness of the Metacrawler.
Keywords :
Internet; Web sites; information retrieval; pattern clustering; search engines; Web crawlers; Web document retrieval; Web page retrieval; Web search engines; World Wide Web; architectural framework; clustering method; general purpose crawlers; information retrieval; metacrawler design; metasearch engines; modified PageRank algorithm; topic-drift problem; Algorithm design and analysis; Clustering algorithms; Crawlers; Engines; Metasearch; Search engines; Web pages; Clustering; Metacrawler; Ranking Algorithms; Search Engine; Web Crawler;
Conference_Titel :
Intelligent Systems Design and Applications (ISDA), 2012 12th International Conference on
Conference_Location :
Kochi
Print_ISBN :
978-1-4673-5117-1
DOI :
10.1109/ISDA.2012.6416585