• DocumentCode
    2449469
  • Title

    Joint n-gram Chinese language modeling with an application to Chinese word segmentation

  • Author

    He, Xin ; Ou, Zhijian ; Sun, Jiasong

  • Author_Institution
    Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
  • fYear
    2012
  • fDate
    16-18 July 2012
  • Firstpage
    319
  • Lastpage
    323
  • Abstract
    The state-of-the-art language models (LMs) are n-gram models, which, for Chinese, are word-based n-grams. To construct Chinese word-based n-gram LMs, we need to have a lexicon and a Chinese word segmentation (CWS) step. However, there is no standard definition of a word in Chinese, and it is always possible to construct new words by combining multiple characters, which causes out-of-vocabulary (OOV) problems. These make lexicon definition and CWS being difficult and ill-defined, which deteriorates the quality of the Chinese LMs. Recently, conditional random fields (CRFs) have been shown to have the ability to perform robust and accurate CWS, especially in recalling OOV words. However they are in essence not Chinese language models, but conditional models of the position-of-character (POC) tag-sequence given the character-sequence. In this paper, we propose a new Chinese language model - joint n-gram, which incorporates the POC tags so that we escape from using a lexicon. It is a truly generative model of Chinese sentences. The effectiveness of the new LM is shown in terms of perplexities and CWS performances.
  • Keywords
    natural language processing; word processing; CRF; CWS; Chinese sentences; Chinese word segmentation; Chinese word-based n-gram LMs; OOV problems; OOV words; POC tag-sequence; conditional random fields; joint n-gram Chinese language modeling; lexicon definition; out-of-vocabulary problems; position-of-character tag-sequence; state-of-the-art language models; Computational modeling; Hidden Markov models; Joints; Robustness; Speech recognition; Standards; Tagging;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Audio, Language and Image Processing (ICALIP), 2012 International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-4673-0173-2
  • Type

    conf

  • DOI
    10.1109/ICALIP.2012.6376633
  • Filename
    6376633