• DocumentCode
    2769129
  • Title

    A WEBIR Crawling Framework for Retrieving Highly Relevant Web Documents: Evaluation Based on Rank Aggregation and Result Merging Algorithms

  • Author

    Shekhar, Shashi ; Arya, K.V. ; Agarwal, Rohit ; Kumar, Rakesh

  • Author_Institution
    GLA Univ., Mathura, India
  • fYear
    2011
  • fDate
    7-9 Oct. 2011
  • Firstpage
    83
  • Lastpage
    88
  • Abstract
    Finding relevant information on the web is an ongoing problem. Commercial search engines like Google rely on sophisticated algorithms to index huge collection of web pages to make them accessible to user queries. Users, however, are still frequently overloaded with irrelevant results. The required information is available in replicated manner scattered in various disjoint databases. For effective web information retrieval, user need to consult several commercial search engines working on different architecture and principles. Rank aggregation and Result merging is the key component of a crawling mechanism used by the commercial search engines. Once the results from various search engines are collected, they need to be merged into a single unified ranked list. The effectiveness of any crawling mechanism is closely related to the rank aggregation and result merging algorithm it employs. In this paper, we investigate a variety of rank aggregation and result merging algorithms based on a wide range of available information. The effectiveness of these algorithms is then compared experimentally to our proposed crawling framework based on queries from the TREC Web track and 3 most popular general-purpose search engines. Our experiments yield two important results. First, simple result merging strategies can outperform Google, Yahoo and MSN Live. Second, Proposed Content Based Result Aggregation (CBRA) algorithm outperforms other existing content based merging algorithms based on full document content.
  • Keywords
    Internet; document handling; query processing; search engines; Google; MSN Live; TREC Web track; WEBIR crawling framework; Web document retrieval; Web information retrieval; Web pages; Yahoo; content based merging algorithms; content based result aggregation algorithm; crawling mechanism; general-purpose search engines; rank aggregation algorithm; result merging algorithm; Computer architecture; Corporate acquisitions; Crawlers; Engines; Merging; Search engines; Web pages; Rank Aggregation; Search result ranking; Web IR; Web crawler; Web page classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Communication Networks (CICN), 2011 International Conference on
  • Conference_Location
    Gwalior
  • Print_ISBN
    978-1-4577-2033-8
  • Type

    conf

  • DOI
    10.1109/CICN.2011.17
  • Filename
    6112832