• DocumentCode
    322759
  • Title

    n-gram estimates in probabilistic models for Pinyin to Hanzi transcription

  • Author

    Lochovsky, Amelia Fong ; Cheung, Hon-Kit

  • Author_Institution
    Dept. of Comput. Sci., Hong Kong Univ. of Sci. & Technol., Hong Kong
  • Volume
    2
  • fYear
    1997
  • fDate
    28-31 Oct 1997
  • Firstpage
    1798
  • Abstract
    We consider the problem of sparse data in probabilistic modeling of the Chinese language. To date, n-gram models outperform models that try to capture linguistical structures. Various techniques for estimating n-gram statistics for the English language have been proposed and compared. It is known that how various techniques actually perform depends on the problem domain in which the probabilistic model is applied. We apply different smoothing techniques in the estimates of bigram statistics in a word based bigram model for Pinyin to Hanzi transcription. Comparative results are reported and show improved accuracy over the MLE method. We have also experimented with hybrid approaches (using bigrams as well as monograms) to achieve superior results
  • Keywords
    language translation; natural languages; probability; word processing; Chinese language; English language; Hanzi transcription; MLE method; Pinyin; bigram statistics; hybrid approaches; linguistical structures; monograms; n-gram estimates; probabilistic model; probabilistic modeling; probabilistic models; problem domain; smoothing techniques; sparse data; word based bigram model; Computer science; Equations; Frequency estimation; Information theory; Maximum likelihood estimation; Natural languages; Optical character recognition software; Smoothing methods; Speech recognition; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Processing Systems, 1997. ICIPS '97. 1997 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    0-7803-4253-4
  • Type

    conf

  • DOI
    10.1109/ICIPS.1997.669366
  • Filename
    669366