Adaptive language modeling with varied sources to cover new vocabulary items

Author

Schwarm, Sarah E. ; Bulyko, Ivan ; Ostendorf, Mari

Author_Institution

Dept. of Comput. Sci. & Eng., Univ. of Washington, Seattle, WA, USA

Volume

12

Issue

3

fYear

2004

fDate

5/1/2004 12:00:00 AM

Firstpage

334

Lastpage

342

Abstract

N-gram language modeling typically requires large quantities of in-domain training data, i.e., data that matches the task in both topic and style. For conversational speech applications, particularly meeting transcription, obtaining large volumes of speech transcripts is often unrealistic; topics change frequently and collecting conversational-style training data is time-consuming and expensive. In particular, new topics introduce new vocabulary items which are not included in existing models. In this work, we use a variety of data sources (reflecting different sizes and styles), combined using mixture n-gram models. We study the impact of the different sources on vocabulary expansion and recognition accuracy, and investigate possible indicators of the usefulness of a data source. For the task of recognizing meeting speech, we obtain a 9% relative reduction in the overall word error rate and a 61% relative reduction in the word error rate for "new" words added to the vocabulary over a baseline language model trained from general conversational speech data.

Keywords

natural languages; speech processing; speech recognition; vocabulary; word processing; adaptive language modeling; conversational speech; conversational-style training data; data sources; meeting transcription; mixture models; n-gram language modeling; speech recognition; speech transcripts; text normalization; vocabulary expansion; vocabulary items; vocabulary recognition; word error rate; Automatic speech recognition; Computer science; Ear; Error analysis; Natural languages; Speech recognition; Testing; Training data; Vocabulary; Web sites;

fLanguage

English

Journal_Title

Speech and Audio Processing, IEEE Transactions on

Publisher

ieee

ISSN

1063-6676

Type

jour

DOI

10.1109/TSA.2004.825666

Filename

1288159