• DocumentCode
    1798347
  • Title

    Analysis of smoothing methods for language models on small Chinese corpora

  • Author

    Ming-Chun Liou ; Feng-Long Huang ; Ming-Shing Yu ; Yih-Jeng Lin

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. United Univ., MaioLi, Taiwan
  • Volume
    2
  • fYear
    2014
  • fDate
    13-16 July 2014
  • Firstpage
    499
  • Lastpage
    505
  • Abstract
    Data sparseness has been an inherited issue of statistical language models and smoothing method shave been used to resolve the issue of zero count. 20 Chinese language models from 1M to 20M Chinese words of CGW have been generated on small sizes corpus because of worse situation of zero count issue. Five smoothing methods, such as Good Turing and Advanced Good Turing smoothing, including our 2 proposed methods, are evaluated and analyzed on inside testing and outside testing. It is shown that to alleviate the issue of data sparseness on various sizes of language models. The best one among these methods is our proposed YH-B which performs best in all the various models.
  • Keywords
    natural language processing; CGW; Data sparseness; YH-B; advanced GoodTuring smoothing; language models; small Chinese corpora; smoothing methods; statistical language models; Abstracts; Acoustics; Analytical models; Artificial intelligence; Entropy; Maximum likelihood estimation; Signal resolution; Cross Entropy; Language Models; Perplexity; Smoothing Methods;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2014 International Conference on
  • Conference_Location
    Lanzhou
  • ISSN
    2160-133X
  • Print_ISBN
    978-1-4799-4216-9
  • Type

    conf

  • DOI
    10.1109/ICMLC.2014.7009658
  • Filename
    7009658