• DocumentCode
    653726
  • Title

    Lightly supervised acoustic model training for imprecisely and asynchronously transcribed speech

  • Author

    Mihajlik, Peter ; Balog, Andras

  • Author_Institution
    THINKTech Res. Center, Vác, Hungary
  • fYear
    2013
  • fDate
    16-19 Oct. 2013
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    In a variety of speech recognition tasks a large amount of approximate transcription is available for the audio material, but is not directly applicable for acoustic model training. Whereas roughly time-synchronous closed-captions or proper audiobook texts are already used in lightly supervised techniques, the utilization of more imperfect and at the same time completely unaligned transcriptions is not self-evident. In this paper we describe our experiments aiming at automated transcription of Hungarian parliamentary speeches. Essentially, a lightly supervised across-domain acoustic model adaptation/retraining is performed. A low-resource broadcast news model is used to bootstrap the process. Relying on automatic recognition of parliamentary training speech and on dynamic text alignment based data selection, a new, task-specific acoustic model is built. For the adaptation to the parliamentary domain, only edited official transcriptions and unaligned speech data are used, without any additional human annotation effort. The adapted acoustic model is applied on unseen target speech in real-time recognition. The word accuracy difference between the automatic and the human powered, official transcription is only 5% (as compared to the exact reference text).
  • Keywords
    acoustic signal processing; audio signal processing; learning (artificial intelligence); natural language processing; speech recognition; text analysis; asynchronously transcribed speech; audio material; automated Hungarian parliamentary speech transcription; automatic parliamentary training speech recognition; dynamic text alignment based data selection; edited official transcriptions; imprecisely transcribed speech; lightly supervised across-domain acoustic model adaptation; lightly supervised across-domain acoustic model retraining; low-resource broadcast news model; process bootstrapping; real-time recognition; task-specific acoustic model; unaligned speech data; unseen target speech; word accuracy difference; Acoustics; Adaptation models; Data models; Filtering; Speech; Speech recognition; Training; acoustic modeling; cross-domain adaptation; large vocabulary continuous speech recognition; lightly supervised training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Technology and Human - Computer Dialogue (SpeD), 2013 7th Conference on
  • Conference_Location
    Cluj-Napoca
  • Type

    conf

  • DOI
    10.1109/SpeD.2013.6682653
  • Filename
    6682653