A WEBIR Crawling Framework for Retrieving Highly Relevant Web Documents: Evaluation Based on Rank Aggregation and Result Merging Algorithms

Author

Shekhar, Shashi ; Arya, K.V. ; Agarwal, Rohit ; Kumar, Rakesh

Author_Institution

GLA Univ., Mathura, India

fYear

2011

fDate

7-9 Oct. 2011

Firstpage

83

Lastpage

88

Abstract

Finding relevant information on the web is an ongoing problem. Commercial search engines like Google rely on sophisticated algorithms to index huge collection of web pages to make them accessible to user queries. Users, however, are still frequently overloaded with irrelevant results. The required information is available in replicated manner scattered in various disjoint databases. For effective web information retrieval, user need to consult several commercial search engines working on different architecture and principles. Rank aggregation and Result merging is the key component of a crawling mechanism used by the commercial search engines. Once the results from various search engines are collected, they need to be merged into a single unified ranked list. The effectiveness of any crawling mechanism is closely related to the rank aggregation and result merging algorithm it employs. In this paper, we investigate a variety of rank aggregation and result merging algorithms based on a wide range of available information. The effectiveness of these algorithms is then compared experimentally to our proposed crawling framework based on queries from the TREC Web track and 3 most popular general-purpose search engines. Our experiments yield two important results. First, simple result merging strategies can outperform Google, Yahoo and MSN Live. Second, Proposed Content Based Result Aggregation (CBRA) algorithm outperforms other existing content based merging algorithms based on full document content.

Keywords

Internet; document handling; query processing; search engines; Google; MSN Live; TREC Web track; WEBIR crawling framework; Web document retrieval; Web information retrieval; Web pages; Yahoo; content based merging algorithms; content based result aggregation algorithm; crawling mechanism; general-purpose search engines; rank aggregation algorithm; result merging algorithm; Computer architecture; Corporate acquisitions; Crawlers; Engines; Merging; Search engines; Web pages; Rank Aggregation; Search result ranking; Web IR; Web crawler; Web page classification;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Communication Networks (CICN), 2011 International Conference on

Conference_Location

Gwalior

Print_ISBN

978-1-4577-2033-8

Type

conf

DOI

10.1109/CICN.2011.17

Filename

6112832