DocumentCode :
1686104
Title :
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
Author :
Li Deng ; Abdel-Hamid, Ossama ; Dong Yu
Author_Institution :
Microsoft Res., Redmond, WA, USA
fYear :
2013
Firstpage :
6669
Lastpage :
6673
Abstract :
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.
Keywords :
acoustic signal processing; hidden Markov models; minimisation; multilayer perceptrons; neural net architecture; speech recognition; HMM; TIMIT phonetic recognition; acoustic invariance; constrained frequency-shift invariance; continuous speech recognition; deep convolutional neural network architecture; dropout regularization; error rate reduction; formant frequency modification; heterogeneous pooling layers; large vocabulary speech recognition; multilayer neural network; phone error rate; phonetic confusion; speech spectrogram; speech-class confusion minimization; voice search task; Abstracts; Acoustics; Computer architecture; Image recognition; Indexes; Speech; Speech recognition; convolution; deep; discrimination; formants; heterogeneous pooling; invariance; neural network;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
ISSN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2013.6638952
Filename :
6638952
Link To Document :
بازگشت