Abstract
In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions are preferred by humans over $19.8\%$ of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual features.
Abstract (translated by Google)
在本文中,我们探讨图像和基于句子的描述之间的双向映射。我们建议使用循环神经网络来学习这种映射。与将句子和图像映射到通用嵌入的先前方法不同,我们可以在给定图像的情况下生成新的句子。使用相同的模型,我们还可以重建与图像相关的视觉特征,给出其视觉描述。我们使用一种新颖的经常性视觉记忆,自动学习记忆长期的视觉概念,以协助在句子生成和视觉特征重建。我们评估我们的方法在几个任务。这些包括句子生成,句子检索和图像检索。显示了用于生成新颖图像描述的任务的最新结果。与人类生成的字幕相比,我们自动生成的字幕是人类首选的超过$ 19.8 \%$的时间。对于使用相似视觉特征的方法的图像和语句检索任务,结果优于或相当于最新的结果。
URL
https://arxiv.org/abs/1411.5654