DocumentCode
3179626
Title
Learning spoken words from multisensory input
Author
Yu, Chen ; Ballard, Dana H.
Author_Institution
Dept. of Comput. Sci., Rochester Univ., NY, USA
Volume
2
fYear
2002
fDate
26-30 Aug. 2002
Firstpage
998
Abstract
Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. We present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic (two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data.
Keywords
acoustic signal processing; eye; language translation; natural languages; speech recognition; video signal processing; acoustic signal processing; acoustic signals; audio-visual lexicons; baking data; dyadic two-person conversation; eye movement records; multimodal learning system; multisensory information; multisensory input; nonlinguistic information; speech recognition; speech translation; spoken word learning; video processing; visual semantics; visual signals; Authentication; Computer science; Humans; Learning systems; Loudspeakers; Natural languages; Pediatrics; Signal processing; Speech processing; Speech recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Signal Processing, 2002 6th International Conference on
Print_ISBN
0-7803-7488-6
Type
conf
DOI
10.1109/ICOSP.2002.1179956
Filename
1179956
Link To Document