Title :
Statistics-based segment pattern lexicon-a new direction for Chinese language modeling
Author :
Yang, Kae-Chemg ; Ho, Tai-Hsuan ; Chien, Lee-Feng ; Lee, Lin-shan
Author_Institution :
Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
Abstract :
This paper presents a new direction for Chinese language modeling based on a different concept of the lexicon. Because every Chinese character has its own meaning and there are no “blanks” in Chinese sentences serving as word boundaries, also because the wording structure in the Chinese language is extremely flexible, the “words” in Chinese are actually not well defined, and there does not exist a commonly accepted lexicon. This makes language modeling very sophisticated in the Chinese language, and the “out of vocabulary (OOV)” problem specially serious. A new concept for the lexicon is thus proposed. The elements of this lexicon can be words or any other “segment patterns”. They should be extracted from the training corpus by statistical approaches with a goal to minimize the overall perplexity. The language models can then be developed based on this new lexicon. Very encouraging experimental results have been obtained
Keywords :
feature extraction; natural languages; parameter estimation; speech processing; speech recognition; statistical analysis; Chinese character; Chinese language modeling; experimental results; forward-backward training algorithm; language models; large vocabulary speech recognition; out of vocabulary problem; parameter estimation; perplexity minimisation; segment pattern extraction; sentence segmentation; statistics-based segment pattern lexicon; training corpus; wording structure; Character generation; Computer science; Context modeling; Decoding; Information science; Natural languages; Power generation; Speech recognition; Vocabulary;
Conference_Titel :
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7803-4428-6
DOI :
10.1109/ICASSP.1998.674394