Lightly supervised acoustic model training for imprecisely and asynchronously transcribed speech

Author

Mihajlik, Peter ; Balog, Andras

Author_Institution

THINKTech Res. Center, Vác, Hungary

fYear

2013

fDate

16-19 Oct. 2013

Firstpage

1

Lastpage

5

Abstract

In a variety of speech recognition tasks a large amount of approximate transcription is available for the audio material, but is not directly applicable for acoustic model training. Whereas roughly time-synchronous closed-captions or proper audiobook texts are already used in lightly supervised techniques, the utilization of more imperfect and at the same time completely unaligned transcriptions is not self-evident. In this paper we describe our experiments aiming at automated transcription of Hungarian parliamentary speeches. Essentially, a lightly supervised across-domain acoustic model adaptation/retraining is performed. A low-resource broadcast news model is used to bootstrap the process. Relying on automatic recognition of parliamentary training speech and on dynamic text alignment based data selection, a new, task-specific acoustic model is built. For the adaptation to the parliamentary domain, only edited official transcriptions and unaligned speech data are used, without any additional human annotation effort. The adapted acoustic model is applied on unseen target speech in real-time recognition. The word accuracy difference between the automatic and the human powered, official transcription is only 5% (as compared to the exact reference text).

Keywords

acoustic signal processing; audio signal processing; learning (artificial intelligence); natural language processing; speech recognition; text analysis; asynchronously transcribed speech; audio material; automated Hungarian parliamentary speech transcription; automatic parliamentary training speech recognition; dynamic text alignment based data selection; edited official transcriptions; imprecisely transcribed speech; lightly supervised across-domain acoustic model adaptation; lightly supervised across-domain acoustic model retraining; low-resource broadcast news model; process bootstrapping; real-time recognition; task-specific acoustic model; unaligned speech data; unseen target speech; word accuracy difference; Acoustics; Adaptation models; Data models; Filtering; Speech; Speech recognition; Training; acoustic modeling; cross-domain adaptation; large vocabulary continuous speech recognition; lightly supervised training;

fLanguage

English

Publisher

ieee

Conference_Titel

Speech Technology and Human - Computer Dialogue (SpeD), 2013 7th Conference on

Conference_Location

Cluj-Napoca

Type

conf

DOI

10.1109/SpeD.2013.6682653

Filename

6682653