• DocumentCode
    19900
  • Title

    Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition

  • Author

    Shaofei Xue ; Abdel-Hamid, Ossama ; Hui Jiang ; Lirong Dai ; Qingfeng Liu

  • Author_Institution
    Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
  • Volume
    22
  • Issue
    12
  • fYear
    2014
  • fDate
    Dec. 2014
  • Firstpage
    1713
  • Lastpage
    1725
  • Abstract
    Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task (“Deep convolutional neural networks for LVCSR,” T. N. Sainath , Proc. IEEE Acoust., Speech, Signal Process., 2013).
  • Keywords
    entropy; learning (artificial intelligence); neural nets; speaker recognition; speech coding; Switchboard task; TIMIT phone recognition; adaptation data; condition code; connection weight learning; deep learning; deep neural network fast adaptation; direct model adaptation; discriminant codes; error reduction; feature space; frame-level cross-entropy; general adaptation scheme; joint speaker adaptive training; large vocabulary speech recognition; nonlinear feature normalization; pre-trained DNN; sequence training criterion; sequence-level maximum mutual information training criterion; speaker-code-based adaptation methods; standard speech recognition tasks; supervised speaker adaptation; test condition; training data; Adaptation models; Artificial neural networks; Hidden Markov models; Speech recognition; Training; Vectors; Condition code; cross entropy (CE); deep neural network (DNN); fast adaptation; maximum mutual information (MMI); speaker code;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2014.2346313
  • Filename
    6874531