DocumentCode
1484009
Title
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Author
Dahl, George E. ; Yu, Dong ; Deng, Li ; Acero, Alex
Author_Institution
Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada
Volume
20
Issue
1
fYear
2012
Firstpage
30
Lastpage
42
Abstract
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.
Keywords
Gaussian processes; hidden Markov models; maximum likelihood estimation; neural nets; speech recognition; DNN-HMM; GMM; LVSR; MPE; context-dependent Gaussian mixture model; context-dependent pretrained deep neural network; deep belief network pretraining algorithm; hidden Markov model; large-vocabulary speech recognition; maximum-likelihood criteria; minimum phone error rate; relative error reduction; Acoustics; Artificial neural networks; Context modeling; Hidden Markov models; Mathematical model; Speech recognition; Training; Artificial neural network–hidden Markov model (ANN-HMM); context-dependent phone; deep belief network; deep neural network hidden Markov model (DNN-HMM); large-vocabulary speech recognition (LVSR); speech recognition;
fLanguage
English
Journal_Title
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher
ieee
ISSN
1558-7916
Type
jour
DOI
10.1109/TASL.2011.2134090
Filename
5740583
Link To Document