مرکز منطقه ای اطلاع رساني علوم و فناوري - State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

DocumentCode :

21520

Title :

State-Clustering Based Multiple Deep Neural Networks Modeling Approach for Speech Recognition

Author :

Pan Zhou ; Hui Jiang ; Li-Rong Dai ; Yu Hu ; Qing-Feng Liu

Author_Institution :

Nat. Eng. Lab. of Speech & Language Inf. Process., Univ. of Sci. & Technol. of China, Hefei, China

Volume :

Issue :

fYear :

2015

fDate :

Apr-15

Firstpage :

631

Lastpage :

642

Abstract :

The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.

Keywords :

graphics processing units; hidden Markov models; neural nets; speech recognition; speech recognition equipment; ASR; DNN-based acoustic model; GPU; HMM; Mandarin transcription task; automatic speech recognition; frame-level cross-entropy; hidden Markov model; hybrid deep neural network; mDNN method; multiple deep neural networks modeling; sequence-level discriminative training; state-clustering; switchboard task; time 320 hour; time 64 hour; Acoustics; Computational modeling; Hidden Markov models; Speech; Speech recognition; Training; Training data; Cross entropy training; data partition; deep neural networks (DNN); model parallelism; multiple DNNs (mDNN); parallel training; sequence training; speech recognition; state clustering;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2015.2392944

Filename :

7010902

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=21520