DocumentCode :
3141515
Title :
Dirichlet Process Mixture Models based topic identification for short text streams
Author :
Wang, Chan ; Yuan, Caixia ; Wang, Xiaojie ; Xue, Wenwei
Author_Institution :
Center of Intell. Sci. & Technol., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2011
fDate :
27-29 Nov. 2011
Firstpage :
80
Lastpage :
87
Abstract :
Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.
Keywords :
stochastic processes; text analysis; Dirichlet process mixture models; TDT3 evaluation data; algorithm flow design; news text; short text streams; story segmentation; topic detection and tracking; topic identification; topic segmentation; word dependency; Manganese; DPMM; Dirichlet Process Mixture Model; data streams; extended DPMM; static short text; topic identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Language Processing andKnowledge Engineering (NLP-KE), 2011 7th International Conference on
Conference_Location :
Tokushima
Print_ISBN :
978-1-61284-729-0
Type :
conf
DOI :
10.1109/NLPKE.2011.6138173
Filename :
6138173
Link To Document :
بازگشت