Multipose audio-visual speech recognition

Author

Estellers, Virginia ; Thiran, Jean-Philippe

Author_Institution

Signal Process. Lab. LTS5, Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland

fYear

2011

fDate

Aug. 29 2011-Sept. 2 2011

Firstpage

1065

Lastpage

1069

Abstract

In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on the effects of a changing pose of the speaker relative to the camera, a problem encountered in natural situations. To that purpose, we introduce a pose normalization technique and perform speech recognition from multiple views by generating virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition studies and relies on linear regression to find an approximate mapping between images from different poses. Lipreading experiments quantify the loss of performance related to pose changes and the proposed pose normalization techniques, while audio-visual results analyse how an audio-visual system should account for non-frontal poses in terms of the weight assigned to the visual modality in the audio-visual classifier.

Keywords

audio-visual systems; face recognition; speech recognition; approximate mapping; audio-visual classifier; audio-visual system; multipose audio-visual speech recognition; pose normalization technique; pose normalization techniques; pose-invariant face recognition; visual modality; Discrete cosine transforms; Feature extraction; Mouth; Speech; Speech recognition; Visualization;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing Conference, 2011 19th European

Conference_Location

Barcelona

ISSN

2076-1465

Type

conf

Filename

7073867