DocumentCode
1493002
Title
Learning Bimodal Structure in Audio–Visual Data
Author
Monaci, Gianluca ; Vandergheynst, Pierre ; Sommer, Friedrich T.
Author_Institution
Redwood Center for Theor. Neurosci., Univ. of California, Berkeley, CA, USA
Volume
20
Issue
12
fYear
2009
Firstpage
1898
Lastpage
1910
Abstract
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
Keywords
audio-visual systems; data structures; dictionaries; unsupervised learning; audio waveform; audio-visual data structures; audio-visual kernels; audio-visual material; audio-visual signals; learning bimodal structure; spatio-temporal visual basis function; unsupervised learning; Audio–visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; Acoustic Stimulation; Algorithms; Artificial Intelligence; Auditory Perception; Computer Simulation; Discrimination Learning; Humans; Learning; Photic Stimulation; Recognition (Psychology); Speech; Visual Perception;
fLanguage
English
Journal_Title
Neural Networks, IEEE Transactions on
Publisher
ieee
ISSN
1045-9227
Type
jour
DOI
10.1109/TNN.2009.2032182
Filename
5280184
Link To Document