DocumentCode
730714
Title
Modeling long temporal contexts in convolutional neural network-based phone recognition
Author
Toth, Laszlo
Author_Institution
MTA-SZTE Res. Group on Artificial Intell., Univ. of Szeged, Szeged, Hungary
fYear
2015
fDate
19-24 April 2015
Firstpage
4575
Lastpage
4579
Abstract
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended by splitting it up and modeling it in smaller chunks. One method for this is to train a hierarchy of two networks, while the less well-known split temporal context (STC) method models the left and right contexts of a frame separately. Here, we evaluate these techniques within a convolutional neural network framework, and find that the two approaches can be nicely combined. With the combined model we can expand the time-span of our network to 69 frames, and we achieve a 7.5% relative error rate reduction compared to modeling this large context as one block. We report a phone error rate of 17.1% on the TIMIT core test set, which is one of the best scores published.
Keywords
convolution; learning (artificial intelligence); neural nets; speech recognition; vectors; STC method; TIMIT core test set; consecutive feature vectors; convolutional neural network-based phone recognition; current hybrid speech recognizers; deep neural network component; long temporal context modeling; phone error rate; relative error rate reduction; split temporal context method; Context; Context modeling; Convolution; Error analysis; Hidden Markov models; Neural networks; Speech recognition; Deep neural network; TIMIT; convolutional neural network; maxout; split temporal context;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location
South Brisbane, QLD
Type
conf
DOI
10.1109/ICASSP.2015.7178837
Filename
7178837
Link To Document