PLSA Based Topic Mixture Language Modeling Approach

Author

Bai, Shuanhu ; Li, Haizhou

Author_Institution

Inst. for Infocomm Res., Singapore, Singapore

fYear

2008

fDate

16-19 Dec. 2008

Firstpage

1

Lastpage

4

Abstract

In this paper, we propose a method to extend the use of latent topics into higher order n-gram models. In training, the parameters of higher order n-gram models are estimated using discounted average counts derived from the application of probabilistic latent semantic analysis(PLSA) models on n-gram counts in training corpus. In decoding, a simple yet efficient topic prediction method is introduced to predict its topic given a new document. The proposed topic mixture language model (TMLM) displays two advantages over previous methods: 1) having the ability of building topic mixture n-gram LM (n>1) and, 2) without requiring a big general baseline LM. The experimental results show that TMLMs, even using smaller number of topics, outperform LMs implemented using both standard n-gram approach and unsupervised adaptation approaches in terms of perplexity reductions.

Keywords

learning (artificial intelligence); natural language processing; higher order n-gram models; probabilistic latent semantic analysis models; topic mixture language modeling; topic prediction method; training corpus; Algorithm design and analysis; Bayesian methods; Clustering algorithms; Decoding; Displays; Error analysis; Prediction methods; Singular value decomposition; Testing; Text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing, 2008. ISCSLP '08. 6th International Symposium on

Conference_Location

Kunming

Print_ISBN

978-1-4244-2942-4

Electronic_ISBN

978-1-4244-2943-1

Type

conf

DOI

10.1109/CHINSL.2008.ECP.58

Filename

4730312