Modeling long temporal contexts in convolutional neural network-based phone recognition

Author

Toth, Laszlo

Author_Institution

MTA-SZTE Res. Group on Artificial Intell., Univ. of Szeged, Szeged, Hungary

fYear

2015

fDate

19-24 April 2015

Firstpage

4575

Lastpage

4579

Abstract

The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended by splitting it up and modeling it in smaller chunks. One method for this is to train a hierarchy of two networks, while the less well-known split temporal context (STC) method models the left and right contexts of a frame separately. Here, we evaluate these techniques within a convolutional neural network framework, and find that the two approaches can be nicely combined. With the combined model we can expand the time-span of our network to 69 frames, and we achieve a 7.5% relative error rate reduction compared to modeling this large context as one block. We report a phone error rate of 17.1% on the TIMIT core test set, which is one of the best scores published.

Keywords

convolution; learning (artificial intelligence); neural nets; speech recognition; vectors; STC method; TIMIT core test set; consecutive feature vectors; convolutional neural network-based phone recognition; current hybrid speech recognizers; deep neural network component; long temporal context modeling; phone error rate; relative error rate reduction; split temporal context method; Context; Context modeling; Convolution; Error analysis; Hidden Markov models; Neural networks; Speech recognition; Deep neural network; TIMIT; convolutional neural network; maxout; split temporal context;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Conference_Location

South Brisbane, QLD

Type

conf

DOI

10.1109/ICASSP.2015.7178837

Filename

7178837