Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
Over the past decade or so, several advances have been made to the design of modern large vocabulary continuous speech recognition (LVCSR) systems to the point where their application has broadened from early speaker dependent dictation systems to speaker-independent automatic broadcast news transcription and indexing, lectures and meetings transcription, conversational telephone speech transcription, open-domain voice search, medical and legal speech recognition, and call center applications, to name a few. The commercial success of these systems is an impressive testimony to how far research in LVCSR has come, and the aim of this article is to describe some of the technological underpinnings of modern systems. It must be said, however, that, despite the commercial success and widespread adoption, the problem of large-vocabulary speech recognition is far from being solved: background noise, channel distortions, foreign accents, casual and disfluent speech, or unexpected topic change can cause automated systems to make egregious recognition errors. This is because current LVCSR systems are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.
Keywords :
speaker recognition; vocabulary; background noise; call center application; casual speech; channel distortion; conversational telephone speech transcription; disfluent speech; foreign accent; large-vocabulary continuous speech recognition system; lectures transcription; legal speech recognition; medical speech recognition; meetings transcription; open-domain voice search; speaker dependent dictation system; speaker-independent automatic broadcast news indexing; speaker-independent automatic broadcast news transcription; Acoustics; Adaptation models; Automatic speech recognition; Hidden Markov models; Speech recognition; Vocabularies;