• DocumentCode
    3166940
  • Title

    Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval

  • Author

    Wang, Xuerui ; McCallum, Andrew ; Wei, Xing

  • Author_Institution
    Univ. of Massachusetts, Amherst
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    697
  • Lastpage
    702
  • Abstract
    Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the \´politics\´ topic, but not in the \´real estate\´ topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.
  • Keywords
    data mining; information retrieval; probability; sampling methods; text analysis; information retrieval; phrase/topic discovery; probabilistic model; text mining; topic model; topic-specific bigram distribution; topic-specific unigram distribution; topical n-grams; word sampling; Artificial neural networks; Biological neural networks; Context modeling; Data mining; Information retrieval; Natural language processing; Neuroscience; Sampling methods; Text mining; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.86
  • Filename
    4470313