Learning spoken words from multisensory input

Author

Yu, Chen ; Ballard, Dana H.

Author_Institution

Dept. of Comput. Sci., Rochester Univ., NY, USA

Volume

2

fYear

2002

fDate

26-30 Aug. 2002

Firstpage

998

Abstract

Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. We present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic (two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data.

Keywords

acoustic signal processing; eye; language translation; natural languages; speech recognition; video signal processing; acoustic signal processing; acoustic signals; audio-visual lexicons; baking data; dyadic two-person conversation; eye movement records; multimodal learning system; multisensory information; multisensory input; nonlinguistic information; speech recognition; speech translation; spoken word learning; video processing; visual semantics; visual signals; Authentication; Computer science; Humans; Learning systems; Loudspeakers; Natural languages; Pediatrics; Signal processing; Speech processing; Speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing, 2002 6th International Conference on

Print_ISBN

0-7803-7488-6

Type

conf

DOI

10.1109/ICOSP.2002.1179956

Filename

1179956