Fusion of audio and video data by neural networks for robust vowel recognition

Author

Kroschel, K. ; Mekhaiel, M.S. ; Berthommier, F.

Author_Institution

Univ. Karlsruhe (Tech. Hochschule), Karlsruhe, Germany

fYear

1999

fDate

Aug. 31 1999-Sept. 3 1999

Firstpage

3029

Lastpage

3034

Abstract

The performance of speech recognition systems decreases dramatically in noisy environments. A robust human-computer interaction system should therefore make use of both, acoustic and visual signals. In this paper we present an automatic vowel recognition system which can perform in quasi real-time. We focus mainly on the recognition of 5 different German vowels (a, e, i, o, u) and their corresponding visemes (images). First the position of the continuously moving face is determined. The speech parameters of the spoken vowel along with the model parameters of the lip´s image are fed to a neural network to recognize the uttered vowel. The face tracking is shape-independent and hence no special requirements concerning the color or shape of the face are needed.

Keywords

audio signal processing; face recognition; human computer interaction; natural language processing; neural nets; speech recognition; video signal processing; German vowel recognition; German vowel visemes; audio data fusion; automatic vowel recognition system; continuously moving face; face color; face shape; lip image; neural networks; robust human-computer interaction system; robust vowel recognition; speech parameters; speech recognition systems; video data fusion; data fusion; lipreading; neural networks; vowel recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Control Conference (ECC), 1999 European

Conference_Location

Karlsruhe

Print_ISBN

978-3-9524173-5-5

Type

conf

Filename

7099790