• DocumentCode
    3537521
  • Title

    Greedy and Randomized Feature Selection for Web Search Ranking

  • Author

    Pan, Feng ; Converse, Tim ; Ahn, David ; Salvetti, Franco ; Donato, Gianluca

  • Author_Institution
    Bing SF, Microsoft Corp., San Francisco, CA, USA
  • fYear
    2011
  • fDate
    Aug. 31 2011-Sept. 2 2011
  • Firstpage
    436
  • Lastpage
    442
  • Abstract
    Modern search engines have to be fast to satisfy users, so there are hard back-end latency requirements. The set of features useful for search ranking functions, though, continues to grow, making feature computation a latency bottleneck. As a result, not all available features can be used for ranking, and in fact, much of the time only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. To this end, we explore different feature selection methods using boosted regression trees, including both greedy approaches (i.e., selecting the features with the highest relative influence as computed by boosted trees, discounting importance by feature similarity) and randomized approaches (i.e., best-only genetic algorithm, a proposed more efficient randomized method with feature-importance-based backward elimination). We evaluate and compare these approaches using two data sets, one from a commercial Wikipedia search engine and the other from a commercial Web search engine. The experimental results show that the greedy approach that selects top features with the highest relative influence performs close to the full-feature model, and the randomized feature selection with feature-importance-based backward elimination outperforms all other randomized and greedy approaches, especially on the Wikipedia data.
  • Keywords
    Web sites; greedy algorithms; information retrieval; random processes; regression analysis; search engines; tree data structures; Web search engine; Wikipedia; backend latency requirements; boosted regression trees; data sets; full-feature model; greedy approach; random feature selection; search ranking functions; Data models; Electronic publishing; Encyclopedias; Feature extraction; Genetic algorithms; Internet; Feature Selection; Learning to Rank; Web Search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology (CIT), 2011 IEEE 11th International Conference on
  • Conference_Location
    Pafos
  • Print_ISBN
    978-1-4577-0383-6
  • Electronic_ISBN
    978-0-7695-4388-8
  • Type

    conf

  • DOI
    10.1109/CIT.2011.16
  • Filename
    6036806