Title :
On Reducing Harmonic and Sampling Distortion in Vocal Tract Length Normalization
Author :
Yoma, N.B. ; Garreton, C. ; Huenupan, F. ; Catalan, I. ; Wuth SepuÌlveda, J.
Author_Institution :
Dept. of Electr. Eng., Univ. de Chile, Santiago, Chile
Abstract :
This paper proposes a novel feature-space VTLN (vocal tract length normalization) method that models frequency warping as a linear interpolation of contiguous Mel filter-bank energies. The presented technique aims to reduce the distortion in the Mel filter-bank energy estimation due to the harmonic composition of voiced speech intervals and DFT (discrete Fourier transform) sampling when the central frequency of band-pass filters is shifted. This paper also proposes an analytical maximum likelihood (ML) method to estimate the optimal warping factor in the cepstral space. The presented interpolated filter-bank energy-based VTLN leads to relative reductions in WER (word error rate) as high as 11.2% and 7.6% when compared with the baseline system and standard VTLN, respectively, in a medium-vocabulary continuous speech recognition task. Also, the proposed VTLN scheme can provide significant reductions in WER when compared with state-of-the-art VTLN methods based on linear transforms in the cepstral feature-space. The warping factor estimated with the proposed VTLN approach shows more dependence on the speaker and more independence of the acoustic-phonetic content than the warping factor resulting from standard and state-of-the-art VTLN methods. Finally, the analytical ML-based optimization scheme presented here achieves almost the same reductions in WER as the ML grid search version of the technique with a computational load 20 times lower.
Keywords :
band-pass filters; discrete Fourier transforms; maximum likelihood estimation; speech processing; speech recognition; DFT; ML grid search; ML method; ML-based optimization scheme; Mel filter-bank energy estimation; WER; acoustic-phonetic content; analytical maximum likelihood; band-pass filters; baseline system; central frequency; cepstral feature-space; cepstral space; contiguous Mel filter-bank energies; discrete Fourier transform sampling; feature-space VTLN; frequency warping modeling; harmonic composition; harmonic reduction; interpolated filter-bank energy-based VTLN; linear interpolation; linear transforms; medium-vocabulary continuous speech recognition task; optimal warping factor; sampling distortion reduction; standard VTLN; vocal tract length normalization method; voiced speech intervals; word error rate; Band pass filters; Harmonic analysis; Interpolation; Maximum likelihood detection; Nonlinear filters; Power harmonic filters; Speech; Speech analysis; speech recognition; vocal tract length normalization;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2012.2215590