DocumentCode
2449469
Title
Joint n-gram Chinese language modeling with an application to Chinese word segmentation
Author
He, Xin ; Ou, Zhijian ; Sun, Jiasong
Author_Institution
Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
fYear
2012
fDate
16-18 July 2012
Firstpage
319
Lastpage
323
Abstract
The state-of-the-art language models (LMs) are n-gram models, which, for Chinese, are word-based n-grams. To construct Chinese word-based n-gram LMs, we need to have a lexicon and a Chinese word segmentation (CWS) step. However, there is no standard definition of a word in Chinese, and it is always possible to construct new words by combining multiple characters, which causes out-of-vocabulary (OOV) problems. These make lexicon definition and CWS being difficult and ill-defined, which deteriorates the quality of the Chinese LMs. Recently, conditional random fields (CRFs) have been shown to have the ability to perform robust and accurate CWS, especially in recalling OOV words. However they are in essence not Chinese language models, but conditional models of the position-of-character (POC) tag-sequence given the character-sequence. In this paper, we propose a new Chinese language model - joint n-gram, which incorporates the POC tags so that we escape from using a lexicon. It is a truly generative model of Chinese sentences. The effectiveness of the new LM is shown in terms of perplexities and CWS performances.
Keywords
natural language processing; word processing; CRF; CWS; Chinese sentences; Chinese word segmentation; Chinese word-based n-gram LMs; OOV problems; OOV words; POC tag-sequence; conditional random fields; joint n-gram Chinese language modeling; lexicon definition; out-of-vocabulary problems; position-of-character tag-sequence; state-of-the-art language models; Computational modeling; Hidden Markov models; Joints; Robustness; Speech recognition; Standards; Tagging;
fLanguage
English
Publisher
ieee
Conference_Titel
Audio, Language and Image Processing (ICALIP), 2012 International Conference on
Conference_Location
Shanghai
Print_ISBN
978-1-4673-0173-2
Type
conf
DOI
10.1109/ICALIP.2012.6376633
Filename
6376633
Link To Document