n-gram estimates in probabilistic models for Pinyin to Hanzi transcription

Author

Lochovsky, Amelia Fong ; Cheung, Hon-Kit

Author_Institution

Dept. of Comput. Sci., Hong Kong Univ. of Sci. & Technol., Hong Kong

Volume

2

fYear

1997

fDate

28-31 Oct 1997

Firstpage

1798

Abstract

We consider the problem of sparse data in probabilistic modeling of the Chinese language. To date, n-gram models outperform models that try to capture linguistical structures. Various techniques for estimating n-gram statistics for the English language have been proposed and compared. It is known that how various techniques actually perform depends on the problem domain in which the probabilistic model is applied. We apply different smoothing techniques in the estimates of bigram statistics in a word based bigram model for Pinyin to Hanzi transcription. Comparative results are reported and show improved accuracy over the MLE method. We have also experimented with hybrid approaches (using bigrams as well as monograms) to achieve superior results

Keywords

language translation; natural languages; probability; word processing; Chinese language; English language; Hanzi transcription; MLE method; Pinyin; bigram statistics; hybrid approaches; linguistical structures; monograms; n-gram estimates; probabilistic model; probabilistic modeling; probabilistic models; problem domain; smoothing techniques; sparse data; word based bigram model; Computer science; Equations; Frequency estimation; Information theory; Maximum likelihood estimation; Natural languages; Optical character recognition software; Smoothing methods; Speech recognition; Statistics;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Processing Systems, 1997. ICIPS '97. 1997 IEEE International Conference on

Conference_Location

Beijing

Print_ISBN

0-7803-4253-4

Type

conf

DOI

10.1109/ICIPS.1997.669366

Filename

669366