• DocumentCode
    2053553
  • Title

    Improving hands-free speech recognition in a car through audio-visual voice activity detection

  • Author

    Faubel, Friedrich ; Georges, Munir ; Kumatani, Kenichi ; Bruhn, Andrés ; Klakow, Dietrich

  • Author_Institution
    Saarland Univ., Saarbrücken, Germany
  • fYear
    2011
  • fDate
    May 30 2011-June 1 2011
  • Firstpage
    70
  • Lastpage
    75
  • Abstract
    In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.
  • Keywords
    acoustic signal detection; audio-visual systems; microphone arrays; speech recognition; AVICAR corpus; acoustic signal detection; audio-visual voice activity detection; delay-and-sum beamforming; hands-free speech recognition; microphone array processing; multichannel audio signal enhancement; noise reduction; noisy car environment; non-speech segments; power spectral subtraction; speaker localization; speaker positions; Feature extraction; Hidden Markov models; Mouth; Noise; Speech; Speech recognition; Visualization; acoustic signal detection; audio-visual systems; automatic speech recognition; microphone arrays; time of arrival estimation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Hands-free Speech Communication and Microphone Arrays (HSCMA), 2011 Joint Workshop on
  • Conference_Location
    Edinburgh
  • Print_ISBN
    978-1-4577-0997-5
  • Type

    conf

  • DOI
    10.1109/HSCMA.2011.5942412
  • Filename
    5942412