papers AI Learner
The Github is limit! Click to go to the new site.

Deep Multimodal Semantic Embeddings for Speech and Images

2015-11-11
David Harwath, James Glass

Abstract

In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We evaluate our model using image search and annotation tasks on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000 spoken captions using Amazon Mechanical Turk.

Abstract (translated by Google)

在本文中,我们提出了一个模型,其中输入的图像与相关的说明字幕的语料库,并找到两种模式之间的对应关系。我们使用一对卷积神经网络来在单词级别对视觉对象和语音信号进行建模,并将网络与嵌入和对齐模型结合在一起,在两种模式上学习联合语义空间。我们使用Flickr8k数据集上的图像搜索和注释任务来评估我们的模型,我们使用Amazon Mechanical Turk收集了40000个口头字幕的语料库。

URL

https://arxiv.org/abs/1511.03690

PDF

https://arxiv.org/pdf/1511.03690


Similar Posts

Comments