DocumentCode
2053553
Title
Improving hands-free speech recognition in a car through audio-visual voice activity detection
Author
Faubel, Friedrich ; Georges, Munir ; Kumatani, Kenichi ; Bruhn, Andrés ; Klakow, Dietrich
Author_Institution
Saarland Univ., Saarbrücken, Germany
fYear
2011
fDate
May 30 2011-June 1 2011
Firstpage
70
Lastpage
75
Abstract
In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.
Keywords
acoustic signal detection; audio-visual systems; microphone arrays; speech recognition; AVICAR corpus; acoustic signal detection; audio-visual voice activity detection; delay-and-sum beamforming; hands-free speech recognition; microphone array processing; multichannel audio signal enhancement; noise reduction; noisy car environment; non-speech segments; power spectral subtraction; speaker localization; speaker positions; Feature extraction; Hidden Markov models; Mouth; Noise; Speech; Speech recognition; Visualization; acoustic signal detection; audio-visual systems; automatic speech recognition; microphone arrays; time of arrival estimation;
fLanguage
English
Publisher
ieee
Conference_Titel
Hands-free Speech Communication and Microphone Arrays (HSCMA), 2011 Joint Workshop on
Conference_Location
Edinburgh
Print_ISBN
978-1-4577-0997-5
Type
conf
DOI
10.1109/HSCMA.2011.5942412
Filename
5942412
Link To Document