• DocumentCode
    2018322
  • Title

    Large vocabulary Uyghur continuous speech recognition based on stems and suffixes

  • Author

    Li, Xin ; Cai, Shang ; Pan, Jielin ; Yan, Yonghong ; Yang, Yafei

  • Author_Institution
    THINKIT Speech Lab., Chinese Acad. of Sci., Beijing, China
  • fYear
    2010
  • fDate
    Nov. 29 2010-Dec. 3 2010
  • Firstpage
    220
  • Lastpage
    223
  • Abstract
    In this paper, we study the vocabulary design problem in Uyghur large vocabulary continuous speech recognition (LVCSR). Uyghur is an agglutinative language in which words can be formed by concatenating several suffixes to the stem. As a result, the number of word types in Uyghur is unlimited. If the word is used as the recognition unit, the out-of-vocabulary (OOV) rate will be very large with typical vocabulary sizes of 60 k-100 k. To avoid this problem, we split words into stems and suffixes and use these sub-words as the recognition units. Speech recognition experiments are performed in two test sets, one including sentences in books and another including sentences in conversations. Compared to the 80 k-word baseline, the use of stems and suffixes can alleviate the OOV rate problem dramatically and the best system reduces the word error rate (WER) from 46.5% to 44.5% in the book sentences test set and from 57.6% to 47.5% in the conversation sentences test set.
  • Keywords
    natural language processing; speech recognition; text analysis; vocabulary; agglutinative language; continuous speech recognition; conversation sentences test; large vocabulary Uyghur; out-of-vocabulary rate; recognition unit; stems; suffixes; vocabulary design problem; word error rate; word types; Acoustics; Books; Databases; Hidden Markov models; Speech; Speech recognition; Vocabulary; Agglutinative language; Stems and suffixes based language model; Uyghur large vocabulary continuous speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Chinese Spoken Language Processing (ISCSLP), 2010 7th International Symposium on
  • Conference_Location
    Tainan
  • Print_ISBN
    978-1-4244-6244-5
  • Type

    conf

  • DOI
    10.1109/ISCSLP.2010.5684909
  • Filename
    5684909