Title :
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
Author_Institution :
MTA-SZTE Res. Group on Artificial Intell., Univ. of Szeged, Szeged, Hungary
Abstract :
Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Keywords :
convolution; frequency-domain analysis; image recognition; neural nets; speaker recognition; telecommunication computing; time-domain analysis; TIMIT phone recognition task; convolutional neural network; error rate; frequency axis; frequency-domain convolution; image recognition; speaker invariance; speaking style variations; spectral representation; speech recognition; time-domain convolution; Biological neural networks; Convolution; Error analysis; Speech recognition; Time-frequency analysis; Training; Deep neural network; TIMIT; convolutional neural network; rectified linear unit; speech recognition;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
Conference_Location :
Florence
DOI :
10.1109/ICASSP.2014.6853584