Title :
Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition
Author :
Kashiwagi, Y. ; Suzuki, M. ; Minematsu, Nobuaki ; Hirose, Keikichi
Author_Institution :
Grad. Sch. of Inf. Sci. & Technol., Univ. of Tokyo, Tokyo, Japan
Abstract :
Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.
Keywords :
feature extraction; integration; noise (working environment); signal denoising; speech recognition; CENSREC-1-AV; audio features; audio-visual feature integration; background noises; decision fusion method; environmental noise; multimodal ASR; multimodal speech recognition; noise robust ASR; noise robust automatic speech recognition; noisy speech recognition; nonaudio features; observed noisy feature; piecewise linear transformation; sensitive noises; visual features; word error reduction rate; Error analysis; Hidden Markov models; Noise; Noise measurement; Speech; Speech recognition; Visualization; Feature enhancement; Multimodal ASR; SPLICE; noise robustness;
Conference_Titel :
Spoken Language Technology Workshop (SLT), 2012 IEEE
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4673-5125-6
Electronic_ISBN :
978-1-4673-5124-9
DOI :
10.1109/SLT.2012.6424213