DocumentCode
873485
Title
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
Author
Vajaria, Himanshu ; Sarkar, Sudeep ; Kasturi, Rangachar
Author_Institution
South Florida Univ., Tampa, FL
Volume
18
Issue
11
fYear
2008
Firstpage
1608
Lastpage
1617
Abstract
This paper presents a bottom-up approach that combines audio and video to simultaneously locate individual speakers in the video (2D source localization) and segment their speech (speaker diarization), in meetings recorded by a single stationary camera and a single microphone. The novelty lies in using motion information from the entire body rather than just the face to perform these tasks, which permits processing nonfrontal views, unlike previous work. Since body movements do not exhibit instantaneous signal-level synchrony with speech, the approach targets long term co-occurrences between audio and video subspaces. First, temporal clustering of the audio produces a large number of intermediate clusters, each containing speech from only a single speaker. Then, spatial clustering is performed in the video frames of each cluster by a novel eigen-analysis method to find the region of dominant motion. This region is associated with the speech assuming that a speaker exhibits more movement than the listeners. Thus, partial diarization and localization is obtained from the intermediate clusters. Speech from an intermediate cluster is modeled by a mixture of Gaussians and the speaker´s location is represented by an eigen-blob model. In the ensuing iterative clustering stage, the diarization and localization results are progressively refined by merging the closest pair of clusters and updating the models until a stop criterion is met. Ideally, each final cluster contains all the speech from a single speaker and the corresponding eigen-blob model localizes the speaker in the image. Experiments conducted on 21 h of real data indicate that the proposed localization approach leads to a relative improvement of 40% over mutual information-based localization and that speaker diarization improves by 16% by incorporating visual information. The proposed approach does not require training and does not rely on a priori hand/face/person detection.
Keywords
audio-visual systems; eigenvalues and eigenfunctions; pattern clustering; video signal processing; 2D source localization; autoguided video localization; dominant motion; eigen blob model; eigen-analysis method; motion information; speaker location; speech-ody movement cooccurence; temporal clustering; Audio-visual association; meeting analysis; speaker diarization; speaker localization;
fLanguage
English
Journal_Title
Circuits and Systems for Video Technology, IEEE Transactions on
Publisher
ieee
ISSN
1051-8215
Type
jour
DOI
10.1109/TCSVT.2008.2005602
Filename
4633640
Link To Document