Improving language models for ASR using translated in-domain data

Author

Stefan Kombrink;Tomáš Mikolov;Martin Karafiát;Lukáš Burget

Author_Institution

Brno University of Technology, Czech

fYear

2012

fDate

3/1/2012 12:00:00 AM

Firstpage

4405

Lastpage

4408

Abstract

Acquisition of in-domain training data to build speech recognition systems for under-resourced languages can be a costly, time-demanding and tedious process. In this work, we propose the use of machine translation to translate English transcripts of telephone speech into Czech language in order to improve a Czech CTS speech recognition system. The translated transcripts are used as additional language model training data in a scenario where the baseline language model is trained on off- and close-domain data only. We report perplexities, OOV and word error rates and examine different data sets and translators on their suitability for the described task.

Keywords

"Data models","Speech","Dictionaries","Google","Speech recognition","Acoustics","Decoding"

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on

ISSN

1520-6149

Print_ISBN

978-1-4673-0045-2

Type

conf

DOI

10.1109/ICASSP.2012.6288896

Filename

6288896

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3648282