HMM-based speech recognition using state-dependent, discriminatively derived transforms on mel-warped DFT features

Author

Chengalvarayan, Rathinavelu ; Deng, Li

Author_Institution

Dept. of Electr. & Comput. Eng., Waterloo Univ., Ont., Canada

Volume

5

Issue

3

fYear

1997

fDate

5/1/1997 12:00:00 AM

Firstpage

243

Lastpage

256

Abstract

In the study reported in this paper, we investigate interactions of front-end feature extraction and back-end classification techniques in hidden Markov model-based (HMM-based) speech recognition. The proposed model focuses on dimensionality reduction of the mel-warped discrete Fourier transform (DFT) feature space subject to maximal preservation of speech classification information, and aims at finding an optimal linear transformation on the mel-warped DFT according to the minimum classification error (MCE) criterion. This linear transformation, along with the HMM parameters, are automatically trained using the gradient descent method to minimize a measure of overall empirical error counts. A further generalization of the model allows integration of the discriminatively derived state-dependent transformation with the construction of dynamic feature parameters. Experimental results show that state-dependent transformation on mel-warped DFT features is superior in performance to the mel-frequency cepstral coefficients (MFCC´s). An error rate reduction of 15% is obtained on a standard 39-class TIMIT phone classification task, in comparison with the conventional MCE-trained HMM using MFCC´s that have not been subject to optimization during training

Keywords

discrete Fourier transforms; error analysis; feature extraction; hidden Markov models; minimisation; speech recognition; HMM-based speech recognition; back-end classification techniques; dimensionality reduction; discrete Fourier transform; empirical error counts; error rate reduction; front-end feature extraction; gradient descent method; hidden Markov model-based speech recognition; linear transformation; mel-frequency cepstral coefficients; mel-warped DFT features; minimum classification error; optimal linear transformation; optimization; speech classification information; standard 39-class TIMIT phone classification task; state-dependent discriminatively derived transforms; training; Cepstral analysis; Data mining; Discrete Fourier transforms; Discrete cosine transforms; Feature extraction; Filter bank; Hidden Markov models; Psychoacoustic models; Speech processing; Speech recognition;

fLanguage

English

Journal_Title

Speech and Audio Processing, IEEE Transactions on

Publisher

ieee

ISSN

1063-6676

Type

jour

DOI

10.1109/89.568731

Filename

568731