مرکز منطقه ای اطلاع رساني علوم و فناوري - Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition

DocumentCode :

3132064

Title :

Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition

Author :

Kashiwagi, Y. ; Suzuki, M. ; Minematsu, Nobuaki ; Hirose, Keikichi

Author_Institution :

Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan

fYear :

2012

fDate :

2-5 Dec. 2012

Firstpage :

149

Lastpage :

152

Abstract :

Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.

Keywords :

feature extraction; integration; noise (working environment); signal denoising; speech recognition; CENSREC-1-AV; audio features; audio-visual feature integration; background noises; decision fusion method; environmental noise; multimodal ASR; multimodal speech recognition; noise robust ASR; noise robust automatic speech recognition; noisy speech recognition; nonaudio features; observed noisy feature; piecewise linear transformation; sensitive noises; visual features; word error reduction rate; Error analysis; Hidden Markov models; Noise; Noise measurement; Speech; Speech recognition; Visualization; Feature enhancement; Multimodal ASR; SPLICE; noise robustness;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Spoken Language Technology Workshop (SLT), 2012 IEEE

Conference_Location :

Miami, FL

Print_ISBN :

978-1-4673-5125-6

Electronic_ISBN :

978-1-4673-5124-9

Type :

conf

DOI :

10.1109/SLT.2012.6424213

Filename :

6424213

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3132064