• DocumentCode
    730775
  • Title

    Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition

  • Author

    Jingjie Li ; McLoughlin, Ian ; Cong Liu ; Shaofei Xue ; Si Wei

  • Author_Institution
    Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4969
  • Lastpage
    4973
  • Abstract
    This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.
  • Keywords
    acoustic signal processing; neural nets; speech recognition; vocabulary; DNN acoustic models; Mandarin whisper dictation task; baseline Gaussian Mixture Model; character error rate; discriminative speaker identity; model adaptation; multitask deep neural network acoustic models; vocabulary continuous whisper automatic recognition; whisper recognition; whisper spectra flatness; whisper spectra formant shifts; Acoustics; Adaptation models; Data models; Neural networks; Speech; Speech recognition; Training; Silent speech interface; Whisper recognition; model adaption; multi-task DNN; speaker code;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178916
  • Filename
    7178916