• DocumentCode
    1224405
  • Title

    Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

  • Author

    Maganti, Hari Krishna ; Gatica-Perez, Daniel ; McCowan, Iain

  • Author_Institution
    Inst. of Neural Inf. Process., Univ. of Ulm, Ulm
  • Volume
    15
  • Issue
    8
  • fYear
    2007
  • Firstpage
    2257
  • Lastpage
    2269
  • Abstract
    This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.
  • Keywords
    array signal processing; audio signal processing; audio-visual systems; filtering theory; microphone arrays; speaker recognition; speech enhancement; tracking filters; audio-visual multiperson tracker; audio-visual sensor array; distant speech acquisition problem; microphone array beamforming techniques; multiparty meetings; postfiltering stage; speech enhancement; speech recognition; Array signal processing; Cameras; Filtering; Microphone arrays; Performance evaluation; Sensor arrays; Sensor systems; Speech analysis; Speech enhancement; Speech recognition; Audio–visual fusion; microphone array processing; multiobject tracking; speech enhancement; speech recognition;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2007.906197
  • Filename
    4317572