مرکز منطقه ای اطلاع رساني علوم و فناوري - Utilizing visual cues in robot audition for sound source discrimination in speech-based human-robot communication

Abstract :

It is easy for human beings to discern whether an observed acoustic signal is a direct speech, reflected speech or noise through simple listening. Relying purely on acoustic cues is enough for human beings to discriminate between the different kinds of sound sources which is not straightforward for machines. A robot equipped with the current robot audition mechanism in most cases, will fail to differentiate a direct speech from the other sound sources because acoustic information alone is insufficient for effective discrimination. Robot audition is an important topic in speech-based human-robot communication. It enables the robot to associate the incoming speech signal to the user for an effective human-robot communication. In challenging environments, this task becomes difficult due to reflections of the direct speech signal and background noise sources. To counter this problem, a robot needs to have a minimum amount of prior information to discriminate the valid speech signal (direct speech) from the contaminants (i.e., speech reflections and background noise sources). Failure to do so would lead to false speech-to-speaker association in robot audition and will gravely impact human-robot communication experience. In this paper we propose to using visual cues to augment the traditional robot audition which relies solely on acoustic information. The proposed method significantly improves accuracy of speech-to-speaker association and machine understanding performance in real environment situation. Experimental results show that our expanded system is robust in discriminating direct speech from speech reflections and background noise sources.