Abstract
Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image caption system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifting among the visual regions imposes a thread of visual ordering. This alignment characterizes the flow of “abstract meaning”, encoding what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets. We show that using either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance.
Abstract (translated by Google)
自动生成图像字幕的最新进展表明,可以用精确和有意义的句子描述图像传递的最显着的信息。在本文中,我们提出了一个图像标题系统,利用图像和句子之间的平行结构。在我们的模型中,给定先前生成的单词的下一个单词的生成过程与视觉感知体验是一致的,在视觉体验中,视觉区域中的注意力转移强加了视觉排序的线索。这种对齐表征了“抽象含义”的流程,即对视觉场景和文本描述两者在语义上共享的内容进行编码。我们的系统还通过引入捕获图像中编码的更高级语义信息的场景特定上下文来做出另一种新颖的建模贡献。上下文将词生成的语言模型调整为特定的场景类型。我们对我们的系统进行了基准测试,并与几个流行数据集上公布的结果进行对比。我们显示,使用基于区域的注意力或场景特定的上下文可以改善没有这些组件的系统。此外,将这两种造型成分结合起来,达到了最先进的性能。
URL
https://arxiv.org/abs/1506.06272