مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition

DocumentCode :

3349409

Title :

Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition

Author :

Aleksic, Petar S. ; Katsaggelos, Aggelos K.

Author_Institution :

Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA

Volume :

fYear :

2004

fDate :

17-21 May 2004

Abstract :

We compare two different groups of visual features that can be used in addition to audio to improve automatic speech recognition (ASR), high- and low-level visual features. Facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, are used as high-level visual features. Principal component analysis (PCA) based projection weights of the intensity images of the mouth area are used as low-level visual features. PCA is also applied on the FAPs. We develop an audio-visual ASR (AV-ASR) system and compare its performance for two different visual feature groups, following two approaches. The first approach assumes the same dimensionality for both high- and low-level visual features, while, in the second approach, the percentage of statistical variance described by the visual features used is the same. Multi-stream hidden Markov models (HMMs) and a late integration approach are used to integrate audio and visual information and perform continuous AV-ASR experiments. Experiments were performed at various SNRs (0-30 dB) with additive white Gaussian noise on a relatively large vocabulary database (approximately 1000 words). Conclusions are drawn on the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.

Keywords :

AWGN; audio-visual systems; hidden Markov models; principal component analysis; speech processing; speech recognition; video signal processing; AV-ASR; AWGN; FAP; MPEG-4 standard; PCA; additive white Gaussian noise; audio-visual ASR; audio-visual automatic speech recognition; continuous automatic speech recognition; facial animation parameters; high-level visual features; late integration; low-level visual features; mouth area; multi-stream HMM; multi-stream hidden Markov models; principal component analysis; statistical variance; visual representation; Additive white noise; Automatic speech recognition; Facial animation; Financial advantage program; Hidden Markov models; MPEG 4 Standard; Mouth; Principal component analysis; Spatial databases; Vocabulary;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on

ISSN :

1520-6149

Print_ISBN :

0-7803-8484-9

Type :

conf

DOI :

10.1109/ICASSP.2004.1327261

Filename :

1327261

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3349409