• DocumentCode
    1421746
  • Title

    A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

  • Author

    Phan, Xuan-Hieu ; Nguyen, Cam-Tu ; Le, Dieu-Thu ; Nguyen, Le-Minh ; Horiguchi, Susumu ; Ha, Quang-Thuy

  • Author_Institution
    Grad. Sch. of Inf. Sci., Tohoku Univ., Sendai, Japan
  • Volume
    23
  • Issue
    7
  • fYear
    2011
  • fDate
    7/1/2011 12:00:00 AM
  • Firstpage
    961
  • Lastpage
    976
  • Abstract
    This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.
  • Keywords
    Internet; advertising; document handling; information retrieval; natural language processing; advertising messages; data sparseness; hidden topic based framework; information retrieval; natural language processing; product descriptions; search result snippets; short Web documents; sparse documents; Advertising; Data mining; Information security; Natural language processing; Predictive models; Text processing; Web search; Web mining; classification; contextual advertising.; hidden topic analysis; matching; ranking; sparse data;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.27
  • Filename
    5416713