Title :
Modeling characters versuswords for mandarin speech recognition
Author :
Luo, Jun ; Lamel, Lori ; Gauvain, Jean-Luc
Author_Institution :
Spoken Language Process. Group, CNRS-LIMSI, Orsay
Abstract :
Word based models are widely used in speech recognition since they typically perform well. However, the question of whether it is better to use a word-based or a character-based model warrants being for the Mandarin Chinese language. Since Chinese is written without any spaces or word delimiters, a word segmentation algorithm is applied in a pre-processing step prior to training a word-based language model. Chinese characters carry meaning and speakers are free to combine characters to construct new words. This suggests that character information can also be useful in communication. This paper explores both word-based and character-based models, and their complementarity. Although word-based modeling is found to outperform character-based modeling, increasing the vocabulary size from 56 k to 160 k words did not lead to a gain in performance. Results are reported for the Gale Mandarin speech-to-text task.
Keywords :
natural language processing; speech recognition; text analysis; Mandarin Chinese language; Mandarin speech recognition; characters modeling; speech-to-text task; words modeling; Contracts; Humans; Natural languages; Particle separators; Performance gain; Speech recognition; TV broadcasting; Technological innovation; Text recognition; Vocabulary; Mandarin Chinese; Speech recognition; language modeling; speech-to-text transcription;
Conference_Titel :
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
Conference_Location :
Taipei
Print_ISBN :
978-1-4244-2353-8
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2009.4960586