Title :
Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams
Author :
Deligne, Sabine ; Bimbot, Frédéric
Author_Institution :
Telecom Paris, France
Abstract :
The multigram model assumes that language can be described as the output of a memoryless source that emits variable-length sequences of words. The estimation of the model parameters can be formulated as a maximum likelihood estimation problem from incomplete data. We show that estimates of the model parameters can be computed through an iterative expectation-maximization algorithm and we describe a forward-backward procedure for its implementation. We report the results of a systematical evaluation of multigrams for language modeling on the ATIS database. The objective performance measure is the test set perplexity. Our results show that multigrams outperform conventional n-grams for this task
Keywords :
estimation theory; grammars; iterative methods; maximum likelihood estimation; natural languages; speech processing; ATIS database; forward-backward procedure; incomplete data; iterative expectation-maximization algorithm; language modeling; maximum likelihood estimation; memoryless source; multigram model; objective performance measure; parameter estimation; test set perplexity; variable length sequences; Bismuth; Databases; Dictionaries; Electronic mail; Expectation-maximization algorithms; Maximum likelihood estimation; Parameter estimation; Probability; Telecommunications; Testing; Vocabulary;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on
Conference_Location :
Detroit, MI
Print_ISBN :
0-7803-2431-5
DOI :
10.1109/ICASSP.1995.479391