Title :
Multimodal (audiovisual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking
Author :
Mohsen Naqvi, Syed ; Wang, W. ; Khan, M.S. ; Barnard, Mark ; Chambers, Jonathon A.
Author_Institution :
Adv. Signal Process. Group, Loughborough Univ., Loughborough, UK
fDate :
7/1/2012 12:00:00 AM
Abstract :
A novel multimodal source separation approach is proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, visual modality is used to facilitate the separation for both stationary and moving sources. The movement of the sources is detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation is performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker are controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources are further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results show that using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources.
Keywords :
Markov processes; Monte Carlo methods; array signal processing; particle filtering (numerical methods); source separation; Markov Chain Monte Carlo particle filter; arrival information; audio measurements; audio-visual system; circular microphone array; mixing filters; moving sources; multimodal source separation; multiple video cameras; multispeaker tracking; robust beamforming; robust spatial beamforming; stationary sources; three-dimensional tracker; time-frequency masking; visual modality;
Journal_Title :
Signal Processing, IET
DOI :
10.1049/iet-spr.2011.0124