DocumentCode
730778
Title
An analysis of convolutional neural networks for speech recognition
Author
Jui-Ting Huang ; Jinyu Li ; Yifan Gong
Author_Institution
Microsoft Corp., Redmond, WA, USA
fYear
2015
fDate
19-24 April 2015
Firstpage
4989
Lastpage
4993
Abstract
Despite the fact that several sites have reported the effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs´ advantage. In the light of this, this paper aims to provide some detailed analysis of CNNs. By visualizing the localized filters learned in the convolutional layer, we show that edge detectors in varying directions can be automatically learned. We then identify four domains we think CNNs can consistently provide advantages over fully-connected deep neural networks (DNNs): channel-mismatched training-test conditions, noise robustness, distant speech recognition, and low-footprint models. For distant speech recognition, a CNN trained on 1000 hours of Kinect distant speech data obtains relative 4% word error rate reduction (WERR) over a DNN of a similar size. To our knowledge, this is the largest corpus so far reported in the literature for CNNs to show its effectiveness. Lastly, we establish that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices. This setup gives relative 9.3% WERR from DNNs with sigmoid units.
Keywords
convolution; edge detection; filters; neural nets; speech recognition; CNN; DNN; Kinect distant speech data; WERR; channel-mismatched training-test conditions; convolutional neural networks; deep neural networks; distant speech recognition; edge detectors; localized filters; low-footprint models; noise robustness; sigmoid units; time 1000 hour; word error rate reduction; Convolution; Feature extraction; Hidden Markov models; Neural networks; Speech; Speech recognition; Training data; Convolutional neural networks; DNN; low footprint models; maxout units;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location
South Brisbane, QLD
Type
conf
DOI
10.1109/ICASSP.2015.7178920
Filename
7178920
Link To Document