DocumentCode
322759
Title
n-gram estimates in probabilistic models for Pinyin to Hanzi transcription
Author
Lochovsky, Amelia Fong ; Cheung, Hon-Kit
Author_Institution
Dept. of Comput. Sci., Hong Kong Univ. of Sci. & Technol., Hong Kong
Volume
2
fYear
1997
fDate
28-31 Oct 1997
Firstpage
1798
Abstract
We consider the problem of sparse data in probabilistic modeling of the Chinese language. To date, n-gram models outperform models that try to capture linguistical structures. Various techniques for estimating n-gram statistics for the English language have been proposed and compared. It is known that how various techniques actually perform depends on the problem domain in which the probabilistic model is applied. We apply different smoothing techniques in the estimates of bigram statistics in a word based bigram model for Pinyin to Hanzi transcription. Comparative results are reported and show improved accuracy over the MLE method. We have also experimented with hybrid approaches (using bigrams as well as monograms) to achieve superior results
Keywords
language translation; natural languages; probability; word processing; Chinese language; English language; Hanzi transcription; MLE method; Pinyin; bigram statistics; hybrid approaches; linguistical structures; monograms; n-gram estimates; probabilistic model; probabilistic modeling; probabilistic models; problem domain; smoothing techniques; sparse data; word based bigram model; Computer science; Equations; Frequency estimation; Information theory; Maximum likelihood estimation; Natural languages; Optical character recognition software; Smoothing methods; Speech recognition; Statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligent Processing Systems, 1997. ICIPS '97. 1997 IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
0-7803-4253-4
Type
conf
DOI
10.1109/ICIPS.1997.669366
Filename
669366
Link To Document