Improving hands-free speech recognition in a car through audio-visual voice activity detection

Author

Faubel, Friedrich ; Georges, Munir ; Kumatani, Kenichi ; Bruhn, Andrés ; Klakow, Dietrich

Author_Institution

Saarland Univ., Saarbrücken, Germany

fYear

2011

fDate

May 30 2011-June 1 2011

Firstpage

70

Lastpage

75

Abstract

In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.

Keywords

acoustic signal detection; audio-visual systems; microphone arrays; speech recognition; AVICAR corpus; acoustic signal detection; audio-visual voice activity detection; delay-and-sum beamforming; hands-free speech recognition; microphone array processing; multichannel audio signal enhancement; noise reduction; noisy car environment; non-speech segments; power spectral subtraction; speaker localization; speaker positions; Feature extraction; Hidden Markov models; Mouth; Noise; Speech; Speech recognition; Visualization; acoustic signal detection; audio-visual systems; automatic speech recognition; microphone arrays; time of arrival estimation;

fLanguage

English

Publisher

ieee

Conference_Titel

Hands-free Speech Communication and Microphone Arrays (HSCMA), 2011 Joint Workshop on

Conference_Location

Edinburgh

Print_ISBN

978-1-4577-0997-5

Type

conf

DOI

10.1109/HSCMA.2011.5942412

Filename

5942412