DocumentCode :
1419922
Title :
Topic-Dependent-Class-Based n -Gram Language Model
Author :
Naptali, Welly ; Tsuchiya, Masatoshi ; Nakagawa, Seiichi
Author_Institution :
Acad. Center for Comput. & Media Studies, Kyoto Univ., Kyoto, Japan
Volume :
20
Issue :
5
fYear :
2012
fDate :
7/1/2012 12:00:00 AM
Firstpage :
1513
Lastpage :
1525
Abstract :
A topic-dependent-class (TDC)-based n-gram language model (LM) is a topic-based LM that employs a semantic extraction method to reveal latent topic information extracted from noun-noun relations. A topic of a given word sequence is decided on the basis of most frequently occuring (weighted) noun classes in the context history through voting. Our previous work (W. Naptali, M. Tsuchiya, and S. Seiichi, “Topic-dependent language model with voting on noun history,”ACM Trans. Asian Language Information Processing (TALIP), vol. 9, no. 2, pp. 1-31, 2010) has shown that in terms of perplexity, TDCs outperform several state-of-the-art baselines, i.e., a word-based or class-based n-gram LM and their interpolation, a cache-based LM, an n-gram-based topic-dependent LM, and a Latent Dirichlet Allocation (LDA)-based topic-dependent LM. This study is a follow up of our previous work and there are three key differences. First, we improve TDCs by employing soft-clustering and/or soft-voting techniques, which solve data shrinking problems and make TDCs independent of the word-based n-gram, in the training and/or test phases. Second, for further improvement, we incorporate a cache-based LM through unigram scaling, because the TDC and cache-based LM capture different properties of the language. Finally, we provide an evaluation in terms of the word error rate (WER) and an analysis of the automatic speech recognition (ASR) rescoring task. Experiments performed on the Wall Street Journal and the Mainichi Shimbun (a Japanese newspaper) demonstrate that the TDC LM improves both perplexity and the WER. The perplexity reduction is up to 25.1% relative on the English corpus and 25.7% relative on the Japanese corpus. Furthermore, the greatest reduction in the WER is 15.2% relative to the English ASR and 24.3 relative to the Japanese ASR, as compared to the baseline.
Keywords :
natural language processing; speech recognition; English corpus; Japanese corpus; automatic speech recognition rescoring task; context history; data shrinking problem; latent dirichlet allocation; latent topic information; noun-noun relations; perplexity reduction; semantic extraction method; soft clustering; soft voting; topic-dependent-class-based n-gram language model; unigram scaling; word error rate; word sequence; word-based n-gram; Context; History; Matrix decomposition; Semantics; Speech; Training; Vectors; $n$-gram; Language model; perplexity; speech recognition; topic dependent;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2012.2183870
Filename :
6129394
Link To Document :
بازگشت