Language model capitalization

Author

Beaufays, Francoise ; Strope, Brian

Author_Institution

Google, Mountain View, CA, USA

fYear

2013

Firstpage

6749

Lastpage

6752

Abstract

In many speech recognition systems, capitalization is not an inherent component of the language model: training corpora are down cased, and counts are accumulated for sequences of lower-cased words. This level of modeling is sufficient for automating voice commands or otherwise enabling users to communicate with a machine, but when the recognized speech is intended to be read by a person, such as in email dictation or even some web search applications, the lack of capitalization of the user´s input can add an extra cognitive load on the reader. For these cases, speech recognition systems often post-process the recognized text to restore capitalization. We propose folding capitalization directly in the recognition language model. Instead of post-processing, we take the approach that language should be represented in all its richness, with capitalization, diacritics, and other special symbols. With that perspective, we describe a strategy to handle poorly capitalized or uncapitalized training corpora for language modeling. The resulting recognition system retains the accuracy/latency/memory tradeoff of our uncapitalized production recognizer, while providing properly cased outputs.

Keywords

speech recognition; text detection; automating voice commands; language model capitalization; lower-cased words; speech recognition systems; training corpora; Accuracy; Data models; Error analysis; Speech; Speech recognition; Training; Training data; Capitalization; FST; language modeling;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on

Conference_Location

Vancouver, BC

ISSN

1520-6149

Type

conf

DOI

10.1109/ICASSP.2013.6638968

Filename

6638968