DocumentCode
3166940
Title
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
Author
Wang, Xuerui ; McCallum, Andrew ; Wei, Xing
Author_Institution
Univ. of Massachusetts, Amherst
fYear
2007
fDate
28-31 Oct. 2007
Firstpage
697
Lastpage
702
Abstract
Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the \´politics\´ topic, but not in the \´real estate\´ topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.
Keywords
data mining; information retrieval; probability; sampling methods; text analysis; information retrieval; phrase/topic discovery; probabilistic model; text mining; topic model; topic-specific bigram distribution; topic-specific unigram distribution; topical n-grams; word sampling; Artificial neural networks; Biological neural networks; Context modeling; Data mining; Information retrieval; Natural language processing; Neuroscience; Sampling methods; Text mining; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
Conference_Location
Omaha, NE
ISSN
1550-4786
Print_ISBN
978-0-7695-3018-5
Type
conf
DOI
10.1109/ICDM.2007.86
Filename
4470313
Link To Document