Title :
Hierarchical discriminant features for audio-visual LVCSR
Author :
Potamianos, Gerasimos ; Luettin, Juergen ; Neti, Chalapathy
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied to MFCC based audio-only features, as well as on visual only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied to the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoiceTM audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition (LVCSR) for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case
Keywords :
audio signal processing; hidden Markov models; matrix algebra; maximum likelihood estimation; sensor fusion; speech recognition; statistical analysis; video signal processing; HMM based speech recognizer; IBM ViaVoice audio-visual database; audio-visual speaker-independent large vocabulary continuous speech recognition; automatic speech recognition; clean audio conditions; discrete cosine transform; feature fusion method; hierarchical discriminant features; hierarchical two-stage discriminant transformation; linear discriminant analysis; maximum likelihood linear transform; noisy audio conditions; single modality features; word error rate; Audio databases; Automatic speech recognition; Discrete cosine transforms; Discrete transforms; Hidden Markov models; Linear discriminant analysis; Mel frequency cepstral coefficient; Spatial databases; Speech recognition; Vocabulary;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on
Conference_Location :
Salt Lake City, UT
Print_ISBN :
0-7803-7041-4
DOI :
10.1109/ICASSP.2001.940793