• DocumentCode
    353716
  • Title

    A unified approach to statistical language modeling for Chinese

  • Author

    Gao, Jianfeng ; Wang, Hai-Feng ; Li, Minding ; Lee, Kai-Fu

  • Author_Institution
    Microsoft Res., Beijing, China
  • Volume
    3
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    1703
  • Abstract
    The paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because: (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, and segments the training data using this lexicon, all using a maximum likelihood principle, which is consistent with the trigram training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported
  • Keywords
    computational linguistics; information resources; maximum likelihood estimation; natural languages; word processing; Chinese; SLM techniques; Web; high-quality lexicon; maximum likelihood principle; pinyin conversion result; standard definition; statistical language modeling; training data set; trigram training; trigrams; unified approach; word boundaries; Dictionaries; Filtering algorithms; Filters; Information retrieval; Maximum likelihood estimation; Natural languages; Optimization methods; Parameter estimation; Speech recognition; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on
  • Conference_Location
    Istanbul
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-6293-4
  • Type

    conf

  • DOI
    10.1109/ICASSP.2000.862079
  • Filename
    862079