• DocumentCode
    3648282
  • Title

    Improving language models for ASR using translated in-domain data

  • Author

    Stefan Kombrink;Tomáš Mikolov;Martin Karafiát;Lukáš Burget

  • Author_Institution
    Brno University of Technology, Czech
  • fYear
    2012
  • fDate
    3/1/2012 12:00:00 AM
  • Firstpage
    4405
  • Lastpage
    4408
  • Abstract
    Acquisition of in-domain training data to build speech recognition systems for under-resourced languages can be a costly, time-demanding and tedious process. In this work, we propose the use of machine translation to translate English transcripts of telephone speech into Czech language in order to improve a Czech CTS speech recognition system. The translated transcripts are used as additional language model training data in a scenario where the baseline language model is trained on off- and close-domain data only. We report perplexities, OOV and word error rates and examine different data sets and translators on their suitability for the described task.
  • Keywords
    "Data models","Speech","Dictionaries","Google","Speech recognition","Acoustics","Decoding"
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4673-0045-2
  • Type

    conf

  • DOI
    10.1109/ICASSP.2012.6288896
  • Filename
    6288896