مرکز منطقه ای اطلاع رساني علوم و فناوري - On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition

DocumentCode :

177461

Title :

On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition

Author :

Shilin Liu ; Khe Chai Sim

Author_Institution :

Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore

fYear :

2014

fDate :

4-9 May 2014

Firstpage :

195

Lastpage :

199

Abstract :

Recently, context-dependent Deep Neural Network (CD-DNN) has been found to significantly outperform Gaussian Mixture Model (GMM) for various large vocabulary continuous speech recognition tasks. Unlike the GMM approach, there is no meaningful interpretation of the DNN parameters, which makes it difficult to devise effective adaptation methods for DNNs. Furthermore, DNN parameter estimation is based on discriminative criteria, which is more sensitive to label errors and therefore less reliable for unsupervised adaptation. Many effective adaptation techniques that have been developed and proven to work well for GMM/HMM systems cannot be easily applied to DNNs. Therefore, this paper proposes a novel method of combining DNN and GMM using the Temporally Varying Weight Regression framework to take advantage of the superior performance of the DNNs and the robust adaptability of the GMMs. This paper addresses the issue of incorporating the high-dimensional CD-DNN posteriors into this framework without dramatically increasing the system complexity. Experimental results on a broadcast news large vocabulary transcription task show that the proposed GMM+DNN/HMM system achieved significant performance gain over the baseline DNN/HMM system. With additional unsupervised speaker adaptation, the best GMM+DNN/HMM system obtained about 20% relative improvements over the DNN/HMM baseline.

Keywords :

Gaussian processes; neural nets; parameter estimation; regression analysis; speaker recognition; CD-DNN; DNN parameter estimation; GMM/HMM systems; Gaussian mixture model; context dependent deep neural network; continuous speech recognition; robust automatic speech recognition; temporally varying weight regression framework; unsupervised adaptation; unsupervised speaker adaptation; vocabulary transcription; Acoustics; Adaptation models; Context; Hidden Markov models; Speech; Speech recognition; Training; Deep Neural Network; Gaussian mixture model; Speaker Adaptation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location :

Florence

Type :

conf

DOI :

10.1109/ICASSP.2014.6853585

Filename :

6853585

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=177461