DocumentCode :
3744850
Title :
Deep multimodal semantic embeddings for speech and images
Author :
David Harwath;James Glass
Author_Institution :
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, 02139, U.S.A
fYear :
2015
Firstpage :
237
Lastpage :
244
Abstract :
In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.
Keywords :
"Spectrogram","Semantics","Visualization","Speech","Neural networks","Image segmentation","Natural languages"
Publisher :
ieee
Conference_Titel :
Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on
Type :
conf
DOI :
10.1109/ASRU.2015.7404800
Filename :
7404800
Link To Document :
بازگشت