DocumentCode
177991
Title
Look who´s talking: Detecting the dominant speaker in a cluttered scenario
Author
D´Arca, Eleonora ; Robertson, Neil M. ; Hopgood, James R.
Author_Institution
Joint Res. Inst. for Signal & Image Process., Heriot-Watt Univ. & Univ. of Edinburgh, Edinburgh, UK
fYear
2014
fDate
4-9 May 2014
Firstpage
1532
Lastpage
1536
Abstract
In this work we propose a novel method to automatically detect and localise the dominant speaker in an enclosed scenario by means of audio and video cues. The underpinning idea is that gesturing means speaking, so observing motions means observing an audio signal. To the best of our knowledge state-of-the-art algorithms are focussed on stationary motion scenarios and close-up scenes where only one audio source exists, whereas we enlarge the extent of the method to larger field of views and cluttered scenarios including multiple non-stationary moving speakers. In such contexts, moving objects which are not correlated to the dominant audio may exist and their motion may incorrectly drive the audio-video (AV) correlation estimation. This suggests extra localisation data may be fused at decision level to avoid detecting false positives. In this work, we learn Mel-frequency cepstral coefficients (MFCC) coefficients and correlate them to the optical flow. We also exploit the audio and video signals to estimate the position of the actual speaker, narrowing down the visual space of search, hence reducing the probability of incurring in a wrong voice-to-pixel region association. We compare our work with a state-of-the-art existing algorithm and show on real datasets a 36% precision improvement in localising a moving dominant speaker through occlusions and speech interferences.
Keywords
audio signal processing; audio-visual systems; cepstral analysis; correlation theory; decision making; image fusion; image sequences; interference suppression; motion estimation; natural scenes; object tracking; speaker recognition; teleconferencing; video signal processing; MFCC; audio cues; audio signal processing; audio source; audio-video correlation estimation; automatic dominant speaker detection; close-up scene; cluttered scenario; decision level; field of view; localisation data fusion; mel frequency cepstral coefficients; moving dominant speaker localisation; nonstationary moving speaker; occlusions; optical flow; position estimation; speech interference; stationary motion scenario; video cues; video signal processing; visual space; voice-to-pixel region association; Acceleration; Correlation; Mel frequency cepstral coefficient; Speech; Speech processing; Vectors; AV Tracking; Audio-Video Correlation; Multimodal tracking; Speaker Recognition; Speaker Tracking;
fLanguage
English
Publisher
ieee
Conference_Titel
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
Conference_Location
Florence
Type
conf
DOI
10.1109/ICASSP.2014.6853854
Filename
6853854
Link To Document