مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

DocumentCode :

2381109

Title :

Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

Author :

Aleksic, Petar S. ; Katsaggelos, Aggelos K.

Author_Institution :

Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA

Volume :

fYear :

2005

fDate :

11-14 Sept. 2005

Abstract :

In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes facial animation parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream hidden Markov models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.

Keywords :

AWGN; computer animation; face recognition; hidden Markov models; image representation; principal component analysis; speech recognition; MPEG-4 facial animation parameter; additive white Gaussian noise; audio-visual speech recognition performance; facial animation parameters; multistream hidden Markov models; principal component analysis; speech representation; visual feature extraction algorithms; visual representation; word error rate; Automatic control; Automatic speech recognition; Decorrelation; Facial animation; Feature extraction; Financial advantage program; Hidden Markov models; MPEG 4 Standard; Principal component analysis; Speech recognition;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Image Processing, 2005. ICIP 2005. IEEE International Conference on

Print_ISBN :

0-7803-9134-9

Type :

conf

DOI :

10.1109/ICIP.2005.1530438

Filename :

1530438

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2381109