• DocumentCode
    3648954
  • Title

    A language model for highly inflective non-agglutinative languages

  • Author

    Stevan Ostrogonac;Dragiša Mišković;Milan Sečujski;Darko Pekar;Vlado Delić

  • Author_Institution
    Faculty of Technical Sciences, University of Novi Sad, Serbia
  • fYear
    2012
  • Firstpage
    177
  • Lastpage
    181
  • Abstract
    This paper proposes a method of creating language models for highly inflective non-agglutinative languages. Three types of language models were considered - a common n-gram model, an n-gram model of lemmas and a class n-gram model. The last two types were specially designed for the Serbian language reflecting its unique grammar structure. All the language models were trained on a carefully collected data set incorporating several literary styles and a great variety of domain-specific textual documents in Serbian. Language models of the three types were created for different sets of textual corpora and evaluated by perplexity values they have given on the test data. A log-linear combination of the common, lemma-based and class n-gram models that was also created shows promising results in overcoming the data sparsity problem. However, the evaluation of this combined model in the context of a large vocabulary continuous speech recognition system (LVCSR) is yet to be done in order to establish the improvement in terms of word error rate (WER).
  • Keywords
    "Data models","Training","Mathematical model","Computational modeling","Vocabulary","Speech recognition","Natural languages"
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems and Informatics (SISY), 2012 IEEE 10th Jubilee International Symposium on
  • Print_ISBN
    978-1-4673-4751-8
  • Type

    conf

  • DOI
    10.1109/SISY.2012.6339510
  • Filename
    6339510