Title :
Variable-order N-gram generation by word-class splitting and consecutive word grouping
Author :
Masataki, Hirokazu ; Sgisaka, Yoshinori
Author_Institution :
ATR Interpreting Telephony Res. Labs., Kyoto, Japan
Abstract :
In this paper, a generation scheme for variable-order N-grams is proposed to attain reliable statistical constraints from a given language corpus. Starting from POS bigrams, the proposed scheme creates variable-order N-grams by splitting a POS into finer groups and by adding frequent consecutive word sequences as word-classes. This word-class splitting and consecutive word grouping are carried out incrementally by minimizing the total entropy. Experiments showed that the perplexity of the proposed model for the test corpus is lower than that for a conventional trigram and that this model requires a quite smaller number of statistical parameters. By applying this model to speech recognition, we get a better recognition rate than using conventional bigrams
Keywords :
minimum entropy methods; natural languages; speech recognition; statistical analysis; POS bigrams; consecutive word grouping; consecutive word sequences; language corpus; perplexity; speech recognition; statistical constraints; test corpus; total entropy; variable-order N-gram generation; word-class splitting; Bellows; Data mining; Entropy; History; Probability; Smoothing methods; Speech recognition; Statistical distributions; Testing;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
0-7803-3192-3
DOI :
10.1109/ICASSP.1996.540322