papers AI Learner
The Github is limit! Click to go to the new site.

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

2016-12-16
Qi Wu, Chunhua Shen, Anton van den Hengel, Peng Wang, Anthony Dick

Abstract

Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.

Abstract (translated by Google)

通过卷积神经网络(CNN)和递归神经网络(RNN)的组合,已经在视觉 - 语言问题方面取得了许多新进展。这种方法没有明确表示高层次的语义概念,而是试图从图像特征直接进入文本。在本文中,我们首先提出一种将高级概念纳入成功的CNN-RNN方法的方法,并且表明它在图像字幕和视觉问题回答方面都取得了显着的进步。我们进一步表明,同样的机制可以用来结合外部的知识,这对于回答高层次的视觉问题是至关重要的。具体而言,我们设计了一个视觉问题回答模型,将图像内容的内部表示与从一般知识库中提取的信息相结合,以回答广泛的基于图像的问题。特别是,即使图像本身不包含完整的答案,也可以询问有关图像内容的问题。我们的最终模型在几个基准数据集的图像字幕和视觉问题回答方面实现了最好的报告结果。

URL

https://arxiv.org/abs/1603.02814

PDF

https://arxiv.org/pdf/1603.02814


Similar Posts

Comments