Feature combination and stacking of recurrent and non-recurrent neural networks for LVCSR

Author

Plahl, Christian ; Kozielski, Michal ; Schluter, Ralf ; Ney, Hermann

Author_Institution

Comput. Sci. Dept., RWTH Aachen Univ., Aachen, Germany

fYear

2013

Firstpage

6714

Lastpage

6718

Abstract

This paper investigates the combination of different short-term features and the combination of recurrent and non-recurrent neural networks (NNs) on a Spanish speech recognition task. Several methods exist to combine different feature sets such as concatenation or linear discriminant analysis (LDA). Even though all these techniques achieve reasonable improvements, feature combination by multi-layer perceptrons (MLPs) outperforms all known approaches. We develop the concept of MLP based feature combination further using recurrent neural networks (RNNs). The phoneme posterior estimates derived from an RNN lead to a significant improvement over the result of the MLPs and achieve a 5% relative better word error rate (WER) with much less parameters. Moreover, we improve the system performance further by combining an MLP and an RNN in a hierarchical framework. The MLP benefits from the preprocessing of the RNN. All NNs are trained on phonemes. Nevertheless, the same concepts could be applied using context-dependent states. In addition to the improvements in recognition performance w.r.t. WER, NN based feature combination methods reduce both, the training and the testing complexity. Overall, the systems are based on a single set of acoustic models, together with the training of different NNs.

Keywords

acoustic signal processing; error statistics; multilayer perceptrons; natural language processing; recurrent neural nets; speech recognition; LVCSR; MLP based feature combination; RNN; Spanish speech recognition task; WER; acoustic models; context-dependent states; multilayer perceptrons; phoneme posterior estimates; recognition performance; recurrent neural networks; system performance; testing complexity; word error rate; Acoustics; Artificial neural networks; Hidden Markov models; Recurrent neural networks; Speech; Speech recognition; Training; feature combination; long-short-term-memory; multi-layer perceptron; recurrent neural networks; speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on

Conference_Location

Vancouver, BC

ISSN

1520-6149

Type

conf

DOI

10.1109/ICASSP.2013.6638961

Filename

6638961