Title :
Active-speaker detection and localization with microphones and cameras embedded into a robotic head
Author :
Cech, Jan ; Mittal, Ravi ; Deleforge, Antoine ; Sanchez-Riera, Jordi ; Alameda-Pineda, Xavier ; Horaud, Radu
Author_Institution :
Center for Machine Perception, CTU Prague, Prague, Czech Republic
Abstract :
In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker that emits a sound, through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones. Both the cameras and the microphones are embedded into the head of a humanoid robot. The proposed statistical fusion model associates 3D faces of potential speakers with 2D sound directions. The paper has two contributions: (i) a method that discretizes the two-dimensional space of all possible sound directions and that accumulates evidence for each direction by estimating the time difference of arrival (TDOA) over all the microphone pairs, such that all the microphones are used simultaneously and symmetrically and (ii) an audio-visual alignment method that maps 3D visual features onto 2D sound directions and onto TDOAs between microphone pairs. This allows to implicitly represent both sensing modalities into a common audiovisual coordinate frame. Using simulated as well as real data, we quantitatively assess the robustness of the method against noise and reverberations, and we compare it with several other methods. Finally, we describe a real-time implementation using the proposed technique and with a humanoid head embedding four microphones and two cameras: this enables natural human-robot interactive behavior.
Keywords :
audio-visual systems; cameras; estimation theory; human-robot interaction; humanoid robots; image fusion; image reconstruction; microphones; robot vision; speaker recognition; 2D sound directions; 3D speaker faces; 3D visual features; TDOA estimation; active-speaker detection; active-speaker localization; audio-visual alignment method; cameras; human-robot interactive behavior; humanoid robot; microphones; robotic head; sound-source localization; statistical fusion model; stereoscopic camera pair; time difference of arrival estimation; visual reconstruction; Cameras; Microphones; Robot kinematics; Robot vision systems; Three-dimensional displays; Visualization;
Conference_Titel :
Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4799-2617-6
DOI :
10.1109/HUMANOIDS.2013.7029977