• DocumentCode
    2403799
  • Title

    Text conditioning and statistical language modeling for Romanian language

  • Author

    Domokos, Jozsef ; Toderean, Gavril ; Buza, Ovidiu

  • Author_Institution
    Commun. Dept., Tech. Univ. of Cluj-Napoca, Cluj-Napoca, Romania
  • fYear
    2009
  • fDate
    18-21 June 2009
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    In this paper we present a synthesis of the theoretical fundamentals and some practical aspects of statistical (n-gram) language modeling which is a main part of a large vocabulary statistical speech recognition system. There are presented the unigram, bigram and trigram language models as well as the Good-turing estimator based Katz back-off smoothing algorithm. There is also described the perplexity measure of a language model used for evaluation. The practical experiments were made on Romanian constitution corpus. There are also presented the text normalization steps before the language model generation. The results are ARPA-MIT format language models for Romanian language. The models were tested and compared using perplexity measure. Finally some comparisons were made between Romanian and English language modeling and conclusions are drawn.
  • Keywords
    hidden Markov models; natural language processing; smoothing methods; speech recognition; speech synthesis; statistical analysis; text analysis; vocabulary; ARPA-MIT format language model; Good-turing estimator; Katz back-off smoothing algorithm; Romanian constitution corpus; Romanian language; hidden Markov model; natural language processing; perplexity measure; speech synthesis; statistical language modeling; text conditioning tool; vocabulary statistical speech recognition system; Constitution; Context modeling; Hidden Markov models; History; Natural languages; Power system modeling; Smoothing methods; Speech recognition; Speech synthesis; Vocabulary; ARPA-MIT language model format; Romanian statistical language modeling; n-gram language modeling; natural language processing; perplexity; smoothing; text conditioning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Technology and Human-Computer Dialogue, 2009. SpeD '09. Proceedings of the 5-th Conference on
  • Conference_Location
    Constant
  • Print_ISBN
    978-1-4244-4727-5
  • Type

    conf

  • DOI
    10.1109/SPED.2009.5156184
  • Filename
    5156184