• DocumentCode
    177461
  • Title

    On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition

  • Author

    Shilin Liu ; Khe Chai Sim

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
  • fYear
    2014
  • fDate
    4-9 May 2014
  • Firstpage
    195
  • Lastpage
    199
  • Abstract
    Recently, context-dependent Deep Neural Network (CD-DNN) has been found to significantly outperform Gaussian Mixture Model (GMM) for various large vocabulary continuous speech recognition tasks. Unlike the GMM approach, there is no meaningful interpretation of the DNN parameters, which makes it difficult to devise effective adaptation methods for DNNs. Furthermore, DNN parameter estimation is based on discriminative criteria, which is more sensitive to label errors and therefore less reliable for unsupervised adaptation. Many effective adaptation techniques that have been developed and proven to work well for GMM/HMM systems cannot be easily applied to DNNs. Therefore, this paper proposes a novel method of combining DNN and GMM using the Temporally Varying Weight Regression framework to take advantage of the superior performance of the DNNs and the robust adaptability of the GMMs. This paper addresses the issue of incorporating the high-dimensional CD-DNN posteriors into this framework without dramatically increasing the system complexity. Experimental results on a broadcast news large vocabulary transcription task show that the proposed GMM+DNN/HMM system achieved significant performance gain over the baseline DNN/HMM system. With additional unsupervised speaker adaptation, the best GMM+DNN/HMM system obtained about 20% relative improvements over the DNN/HMM baseline.
  • Keywords
    Gaussian processes; neural nets; parameter estimation; regression analysis; speaker recognition; CD-DNN; DNN parameter estimation; GMM/HMM systems; Gaussian mixture model; context dependent deep neural network; continuous speech recognition; robust automatic speech recognition; temporally varying weight regression framework; unsupervised adaptation; unsupervised speaker adaptation; vocabulary transcription; Acoustics; Adaptation models; Context; Hidden Markov models; Speech; Speech recognition; Training; Deep Neural Network; Gaussian mixture model; Speaker Adaptation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/ICASSP.2014.6853585
  • Filename
    6853585