Title :
Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling
Author :
Law, Hubert Hin-Cheung ; Chan, Chorlcin
Author_Institution :
Dept. of Comput. Sci., Hong Kong Univ., Hong Kong
Abstract :
A novel ergodic multigram hidden Markov model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words are generated independently based on the word classes, without word boundary marked on the sentence. This model can be applied to languages without word boundary markers such as Chinese. With a lexicon containing syntactic classes for each word, its applications include language modeling for recognizers, and integrated word segmentation and class tagging. Pre-segmented and tagged corpus are not needed for training, and both segmentation and tagging are trained in one single model. In this paper, relevant algorithms for this model are presented, and experimental results on a Chinese news corpus are reported
Keywords :
hidden Markov models; natural languages; speech recognition; stochastic processes; Chinese language modeling; Chinese news corpus; class tagging; doubly stochastic process; ergodic multigram HMM; hidden Markov model; lexicon; multi-character words; sentence production; single character words; syntactic classes; word segmentation; Computer science; Hidden Markov models; Lattices; Maximum likelihood decoding; Natural languages; Production; Stochastic processes; Tagging; Terminology; Viterbi algorithm;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
0-7803-3192-3
DOI :
10.1109/ICASSP.1996.540324