DocumentCode :
1401399
Title :
Exploiting latent semantic information in statistical language modeling
Author :
Bellegarda, Jerome R.
Author_Institution :
Apple Comput. Inc., Cupertino, CA, USA
Volume :
88
Issue :
8
fYear :
2000
Firstpage :
1279
Lastpage :
1296
Abstract :
Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more difficult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several language model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to adjust the standard n-gram probability. Such hybrid language modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the resulting performance.
Keywords :
computational linguistics; natural languages; probability; speech recognition; Wall Street Journal domain; automatic semantic classification; clustering techniques; global constraints; hybrid language modeling; large-vocabulary speech recognition; latent semantic information; local constraints; n-gram probability; semantic relationships; semantic vector space; statistical language modeling; word error rate; Automatic speech recognition; Error analysis; Fasteners; Natural languages; Parameter estimation; Smoothing methods; Speech analysis; Speech recognition; Training data; Vocabulary;
fLanguage :
English
Journal_Title :
Proceedings of the IEEE
Publisher :
ieee
ISSN :
0018-9219
Type :
jour
DOI :
10.1109/5.880084
Filename :
880084
Link To Document :
بازگشت