Title :
Multi-modal speech recognition using correlativity between modality
Author :
Sato, Yuki ; Hamada, Nozomu
Author_Institution :
Dept. of Syst. Design Eng., Keio Univ., Yokohama, Japan
Abstract :
In recent years, to achieve robust speech recognition against noises, Audio-Visual Speech Recognition(AVSR) system utilizing not only audio but also visual information of lip has been studied. This paper proposes a decision method of the weight called stream exponent representing reliability ratio of audio and visual features. The method focuses on the correlation between audio and visual modality in order to estimate the optimal stream exponent. Furthermore, we modified the stream exponent using periodicity of speech, such as pitch, to handle abrupt noises. An audio-visual database is comprised of specific speaker´s lip image sequences and audio sequences. The contents of the utterance are Japanese counting numbers and sound-alike words. Using this database we constructed the AVSR system and performed an evaluation experiment. The obtained results verify the availability of the proposed method under a variety of noisy environment.
Keywords :
audio-visual systems; image sequences; speech recognition; AVSR; audio modality; audio sequences; audio-visual system; image sequences; speech recognition; visual modality; Computational modeling; Noise; Audio-Visual speech recognition; real environment noise; stream exponent;
Conference_Titel :
Intelligent Signal Processing and Communication Systems (ISPACS), 2010 International Symposium on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-7369-4
DOI :
10.1109/ISPACS.2010.5704657