• DocumentCode
    1993330
  • Title

    Learning the lexicon from raw texts for open-vocabulary Korean word recognition

  • Author

    Ryu, Sungho ; Kim, Jin Hyung

  • Author_Institution
    Divion of Comput. Sci., KAIST, Daejon, South Korea
  • fYear
    2003
  • fDate
    3-6 Aug. 2003
  • Firstpage
    202
  • Abstract
    In this paper, we propose a novel method of building a language model for open-vocabulary Korean word recognition. Due to the complex morphology of Korean, it is inappropriate to use lexicons based on the linguistic entities such as words and morphemes in open-vocabulary domains. Instead, we build the lexicon by collecting variable length character sequences from the raw texts using a dynamic Bayesian network model of the language. In simulated word recognition experiments, the proposed language model could find correct words from lattices of character candidates in 94.3% of cases, increasing the word recognition rates by 20.9%.
  • Keywords
    character recognition; grammars; text analysis; Bayesian network model; Korean word recognition; eojeols; language model; lexicon learning; morphemes; open-vocabulary word recognition; variable length character sequence; Bayesian methods; Character recognition; Computer science; Context modeling; Electronic mail; Lattices; Morphology; Natural languages; Probability distribution; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
  • Print_ISBN
    0-7695-1960-1
  • Type

    conf

  • DOI
    10.1109/ICDAR.2003.1227659
  • Filename
    1227659