• DocumentCode
    730778
  • Title

    An analysis of convolutional neural networks for speech recognition

  • Author

    Jui-Ting Huang ; Jinyu Li ; Yifan Gong

  • Author_Institution
    Microsoft Corp., Redmond, WA, USA
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4989
  • Lastpage
    4993
  • Abstract
    Despite the fact that several sites have reported the effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs´ advantage. In the light of this, this paper aims to provide some detailed analysis of CNNs. By visualizing the localized filters learned in the convolutional layer, we show that edge detectors in varying directions can be automatically learned. We then identify four domains we think CNNs can consistently provide advantages over fully-connected deep neural networks (DNNs): channel-mismatched training-test conditions, noise robustness, distant speech recognition, and low-footprint models. For distant speech recognition, a CNN trained on 1000 hours of Kinect distant speech data obtains relative 4% word error rate reduction (WERR) over a DNN of a similar size. To our knowledge, this is the largest corpus so far reported in the literature for CNNs to show its effectiveness. Lastly, we establish that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices. This setup gives relative 9.3% WERR from DNNs with sigmoid units.
  • Keywords
    convolution; edge detection; filters; neural nets; speech recognition; CNN; DNN; Kinect distant speech data; WERR; channel-mismatched training-test conditions; convolutional neural networks; deep neural networks; distant speech recognition; edge detectors; localized filters; low-footprint models; noise robustness; sigmoid units; time 1000 hour; word error rate reduction; Convolution; Feature extraction; Hidden Markov models; Neural networks; Speech; Speech recognition; Training data; Convolutional neural networks; DNN; low footprint models; maxout units;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178920
  • Filename
    7178920