DocumentCode
21520
Title
State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition
Author
Pan Zhou ; Hui Jiang ; Li-Rong Dai ; Yu Hu ; Qing-Feng Liu
Author_Institution
Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China
Volume
23
Issue
4
fYear
2015
fDate
Apr-15
Firstpage
631
Lastpage
642
Abstract
The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.
Keywords
graphics processing units; hidden Markov models; neural nets; speech recognition; speech recognition equipment; ASR; DNN-based acoustic model; GPU; HMM; Mandarin transcription task; automatic speech recognition; frame-level cross-entropy; hidden Markov model; hybrid deep neural network; mDNN method; multiple deep neural networks modeling; sequence-level discriminative training; state-clustering; switchboard task; time 320 hour; time 64 hour; Acoustics; Computational modeling; Hidden Markov models; Speech; Speech recognition; Training; Training data; Cross entropy training; data partition; deep neural networks (DNN); model parallelism; multiple DNNs (mDNN); parallel training; sequence training; speech recognition; state clustering;
fLanguage
English
Journal_Title
Audio, Speech, and Language Processing, IEEE/ACM Transactions on
Publisher
ieee
ISSN
2329-9290
Type
jour
DOI
10.1109/TASLP.2015.2392944
Filename
7010902
Link To Document