DocumentCode :
2269335
Title :
Unsupervised language model adaptation using latent Dirichlet allocation and dynamic marginals
Author :
Haidar, Md Akmal ; O´Shaughnessy, Douglas
Author_Institution :
INRS-Energy, Mater. & Telecommun., Univ. of Quebec, Montreal, QC, Canada
fYear :
2011
fDate :
Aug. 29 2011-Sept. 2 2011
Firstpage :
1480
Lastpage :
1484
Abstract :
In this paper, we introduce an unsupervised language model adaptation approach using latent Dirichlet allocation (LDA) and dynamic marginals: locally estimated (smoothed) unigram probabilities from in-domain text data. In LDA analysis, topic clusters are formed by using a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document. The n-grams of the topic generated by hard-clustering are used to compute the mixture weights of the component topic models. Instead of using all the words of the training vocabulary, selected words are used for LDA analysis, which are chosen by incorporating some information retrieval techniques. We adapted the LDA adapted topic model by minimizing the Kullback-Leibler (KL) divergence between the final adapted model and the LDA adapted topic model subject to a constraint that the marginalized unigram probability distribution of the final adapted model is equal to the dynamic marginals. We have compared our approach with the conventional adapted model obtained by minimizing the KL divergence between the background model and the adapted model using the above constraint. We have seen that our approach gives significant perplexity and word error rate (WER) reductions over the traditional approach.
Keywords :
information retrieval; mixture models; natural language processing; pattern clustering; statistical distributions; text analysis; KL divergence minimization; Kullback-Leibler divergence minimization; LDA adapted topic model; LDA analysis; WER reduction; background model; component topic models; document handling; dynamic marginal; final adapted model; hard-clustering method; in-domain text data; information retrieval techniques; latent Dirichlet allocation; latent dirichlet allocation; locally estimated smoothed unigram probabilities; marginalized unigram probability distribution; maximum word number; mixture weights; n-grams; perplexity reduction; topic assignment; topic clusters; unsupervised language mod- el adaptation approach; unsupervised language model adaptation; word error rate reduction; Adaptation models; Computational modeling; Data models; Equations; Mathematical model; Semantics; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Processing Conference, 2011 19th European
Conference_Location :
Barcelona
ISSN :
2076-1465
Type :
conf
Filename :
7074098
Link To Document :
بازگشت