DocumentCode
1798347
Title
Analysis of smoothing methods for language models on small Chinese corpora
Author
Ming-Chun Liou ; Feng-Long Huang ; Ming-Shing Yu ; Yih-Jeng Lin
Author_Institution
Dept. of Comput. Sci. & Inf. Eng., Nat. United Univ., MaioLi, Taiwan
Volume
2
fYear
2014
fDate
13-16 July 2014
Firstpage
499
Lastpage
505
Abstract
Data sparseness has been an inherited issue of statistical language models and smoothing method shave been used to resolve the issue of zero count. 20 Chinese language models from 1M to 20M Chinese words of CGW have been generated on small sizes corpus because of worse situation of zero count issue. Five smoothing methods, such as Good Turing and Advanced Good Turing smoothing, including our 2 proposed methods, are evaluated and analyzed on inside testing and outside testing. It is shown that to alleviate the issue of data sparseness on various sizes of language models. The best one among these methods is our proposed YH-B which performs best in all the various models.
Keywords
natural language processing; CGW; Data sparseness; YH-B; advanced GoodTuring smoothing; language models; small Chinese corpora; smoothing methods; statistical language models; Abstracts; Acoustics; Analytical models; Artificial intelligence; Entropy; Maximum likelihood estimation; Signal resolution; Cross Entropy; Language Models; Perplexity; Smoothing Methods;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Cybernetics (ICMLC), 2014 International Conference on
Conference_Location
Lanzhou
ISSN
2160-133X
Print_ISBN
978-1-4799-4216-9
Type
conf
DOI
10.1109/ICMLC.2014.7009658
Filename
7009658
Link To Document