DocumentCode
3531045
Title
Resampling auxiliary data for language model adaptation in machine translation for speech
Author
Maskey, Sameer ; Sethy, Abhinav
Author_Institution
IBM T.J. Watson Res. Center, New York, NY
fYear
2009
fDate
19-24 April 2009
Firstpage
4817
Lastpage
4820
Abstract
Performance of n-gram language models depends to a large extent on the amount of training text material available for building the models and the degree to which this text matches the domain of interest. The language modeling community is showing a growing interest in using large collections of auxiliary textual material to supplement sparse in-domain resources. One of the problems in using such auxiliary corpora is that they may differ significantly from the specific nature of the domain of interest. In this paper, we propose three different methods for adapting language models for a speech to speech (S2S) translation system when auxiliary corpora are of different genre and domain. The proposed methods are based on centroid similarity, n-gram ratios and resampled language models. We show how these methods can be used to select out of domain textual data such as newswire text to improve a S2S system. We were able to achieve an overall relative improvement of 3.8% in BLEU score over a baseline system that uses only in-domain conversational data.
Keywords
language translation; speech processing; auxiliary data resampling; language model adaptation; language modeling community; machine translation; n-gram language models; speech to speech translation system; Adaptation model; Entropy; Materials testing; Natural languages; Performance gain; Speech coding; Support vector machine classification; Support vector machines; System testing; Text categorization; Domain Adaptation; Language Model Adaptation; Machine Translation;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
Conference_Location
Taipei
ISSN
1520-6149
Print_ISBN
978-1-4244-2353-8
Electronic_ISBN
1520-6149
Type
conf
DOI
10.1109/ICASSP.2009.4960709
Filename
4960709
Link To Document