• DocumentCode
    2660165
  • Title

    Improving word segmentation for Thai speech translation

  • Author

    Charoenpornsawat, Paisarn ; Schultz, Tanja

  • fYear
    2008
  • fDate
    15-19 Dec. 2008
  • Firstpage
    241
  • Lastpage
    244
  • Abstract
    A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, maximal matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.
  • Keywords
    feature extraction; language translation; natural language processing; speech recognition; statistical analysis; vocabulary; Thai speech translation; language model; lexicon-based approach; maximal matching; speech recognition; statistical techniques; text segmentation; vocabulary list; word extraction; word segmentation; Automatic speech recognition; Dictionaries; Entropy; Natural language processing; Natural languages; Speech recognition; Surface-mount technology; Text processing; Training data; Vocabulary; Speech Recognition; Spoken language translation; Text Processing; Word Segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Spoken Language Technology Workshop, 2008. SLT 2008. IEEE
  • Conference_Location
    Goa
  • Print_ISBN
    978-1-4244-3471-8
  • Electronic_ISBN
    978-1-4244-3472-5
  • Type

    conf

  • DOI
    10.1109/SLT.2008.4777885
  • Filename
    4777885